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ETAPS Foreword 


Welcome to the 24th ETAPS! ETAPS 2021 was originally planned to take place in 
Luxembourg in its beautiful capital Luxembourg City. Because of the Covid-19 pan- 
demic, this was changed to an online event. 

ETAPS 2021 was the 24th instance of the European Joint Conferences on Theory 
and Practice of Software. ETAPS is an annual federated conference established in 
1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each 
conference has its own Program Committee (PC) and its own Steering Committee 
(SC). The conferences cover various aspects of software systems, ranging from theo- 
retical computer science to foundations of programming languages, analysis tools, and 
formal approaches to software engineering. Organising these conferences in a coherent, 
highly synchronised conference programme enables researchers to participate in an 
exciting event, having the possibility to meet many colleagues working in different 
directions in the field, and to easily attend talks of different conferences. On the 
weekend before the main conference, numerous satellite workshops take place that 
attract many researchers from all over the globe. 

ETAPS 2021 received 260 submissions in total, 115 of which were accepted, 
yielding an overall acceptance rate of 44.2%. I thank all the authors for their interest in 
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con- 
tributions, and in particular the PC (co-)chairs for their hard work in running this entire 
intensive process. Last but not least, my congratulations to all authors of the accepted 
papers! 

ETAPS 2021 featured the unifying invited speakers Scott Smolka (Stony Brook 
University) and Jane Hillston (University of Edinburgh) and the conference-specific 
invited speakers Isil Dillig (University of Texas at Austin) for ESOP and Willem Visser 
(Stellenbosch University) for FASE. Inivited tutorials were provided by Erika Abraham 
(RWTH Aachen University) on analysis of hybrid systems and Madhusudan 
Parthasararathy (University of Illinois at Urbana-Champaign) on combining machine 
learning and formal methods. 

ETAPS 2021 was originally supposed to take place in Luxembourg City, Luxem- 
bourg organized by the SnT - Interdisciplinary Centre for Security, Reliability and 
Trust, University of Luxembourg. University of Luxembourg was founded in 2003. 
The university is one of the best and most international young universities with 6,700 
students from 129 countries and 1,331 academics from all over the globe. The local 
organisation team consisted of Peter Y.A. Ryan (general chair), Peter B. Roenne (or- 
ganisation chair), Joaquin Garcia-Alfaro (workshop chair), Magali Martin (event 
manager), David Mestel (publicity chair), and Alfredo Rial (local proceedings chair). 

ETAPS 2021 was further supported by the following associations and societies: 
ETAPS e.V., EATCS (European Association for Theoretical Computer Science), 
EAPLS (European Association for Programming Languages and Systems), and EASST 
(European Association of Software Science and Technology). 
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The ETAPS Steering Committee consists of an Executive Board, and representa- 
tives of the individual ETAPS conferences, as well as representatives of EATCS, 
EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saar- 
briicken), Marieke Huisman (Twente, chair), Jan Kofron (Prague), Barbara König 
(Duisburg), Gerald Liittgen (Bamberg), Caterina Urban (INRIA), Tarmo Uustalu 
(Reykjavik and Tallinn), and Lenore Zuck (Chicago). 

Other members of the steering committee are: Patricia Bouyer (Paris), Einar Broch 
Johnsen (Oslo), Dana Fisman (Be’er Sheva), Jan-Friso Groote (Eindhoven), Esther 
Guerra (Madrid), Reiko Heckel (Leicester), Joost-Pieter Katoen (Aachen and Twente), 
Stefan Kiefer (Oxford), Fabrice Kordon (Paris), Jan Křetínský (Munich), Kim G. 
Larsen (Aalborg), Tiziana Margaria (Limerick), Andrew M. Pitts (Cambridge), Grigore 
Rosu (Illinois), Peter Ryan (Luxembourg), Don Sannella (Edinburgh), Lutz Schröder 
(Erlangen), Ilya Sergey (Singapore), Mariélle Stoelinga (Twente), Gabriele Taentzer 
(Marburg), Christine Tasson (Paris), Peter Thiemann (Freiburg), Jan Vitek (Prague), 
Anton Wijs (Eindhoven), Manuel Wimmer (Linz), and Nobuko Yoshida (London). 

Id like to take this opportunity to thank all the authors, attendees, organizers of the 
satellite workshops, and Springer-Verlag GmbH for their support. I hope you all 
enjoyed ETAPS 2021. 

Finally, a big thanks to Peter, Peter, Magali and their local organisation team for all 
their enormous efforts to make ETAPS a fantastic online event. I hope there will be a 
next opportunity to host ETAPS in Luxembourg. 


February 2021 Marieke Huisman 
ETAPS SC Chair 
ETAPS e.V. President 


Preface 


This volume contains the papers presented at FASE 2021, the 24th International 
Conference on Fundamental Approaches to Software Engineering. FASE 2021 was 
organized as part of the annual European Joint Conferences on Theory and Practice of 
Software (ETAPS 2021). 

FASE is concerned with the foundations on which software engineering is built, 
including topics like software engineering as an engineering discipline, requirements 
engineering, software architectures, software quality, model-driven development, 
software processes, software evolution, search-based software engineering, and the 
specification, design, and implementation of particular classes of systems, such as 
(self-)adaptive, collaborative, intelligent, embedded, distributed, mobile, pervasive, 
cyber-physical, or service-oriented applications. 

FASE 2021 received 51 submissions. The submissions came from the following 
countries (in alphabetical order): Argentina, Australia, Austria, Belgium, Brazil, 
Canada, China, France, Germany, Iceland, India, Ireland, Italy, Luxembourg, Mace- 
donia, Malta, Netherlands, Norway, Russia, Singapore, South Korea, Spain, Sweden, 
Taiwan, United Kingdom, and United States. FASE used a double-blind reviewing 
process. Each submission was reviewed by three Program Committee members. After 
an online discussion period, the Program Committee accepted 16 papers as part of the 
conference program (31% acceptance rate). 

FASE 2021 hosted the 3rd International Competition on Software Testing 
(Test-Comp 2021). Test-Comp is an annual comparative evaluation of testing tools. 
This edition contained 11 participating tools, from academia and industry. These 
proceedings contain the competition report and three system descriptions of partici- 
pating tools. The system-description papers were reviewed and selected by a separate 
program committee: the Test-Comp jury. Each paper was assessed by at least three 
reviewers. Two sessions in the FASE program were reserved for the presentation of the 
results: the summary by the Test-Comp chair and the participating tools by the 
developer teams in the first session, and the community meeting in the second session. 

A lot of people contributed to the success of FASE 2021. We are grateful to the 
Program Committee members and reviewers for their thorough reviews and con- 
structive discussions. We thank the ETAPS 2021 organizers, in particular, 
Peter Y. A. Ryan (General Chair), Joaquin Garcia-Alfaro (Workshops Chair), Peter 
Roenne (Organization Chair), Magali Martin (Event Manager), David Mestel (Publicity 
Chair) and Alfredo Rial (Local Proceedings Chair). We also thank Marieke Huisman 
(Steering Committee Chair of ETAPS 2021) for managing the process, and Gabriele 
Taenzter (Steering Committee Chair of FASE 2021) for her feedback and support. Last 
but not least, we would like to thank the authors for their excellent work. 
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On Benchmarking for 
Concurrent Runtime Verification* 


Luca Aceto23 ®, Duncan Paul Attard®12 ©, 
Adrian Francalanza! ©, and Anna Ingólfsdóttir? © 


1 University of Malta, Msida, Malta {duncan.attard.01,afral}@um. edu.mt 
? Reykjavik University, Reykjavik, Iceland {1uca, duncanpa17,annai}@ru. is 
3 Gran Sasso Science Institute, L’Aquila, Italy {luca.aceto}@gssi.it 


Abstract. We present a synthetic benchmarking framework that tar- 
gets the systematic evaluation of RV tools for message-based concurrent 
systems. Our tool can emulate various load profiles via configuration. 
It provides a multi-faceted view of measurements that is conducive to 
a comprehensive assessment of the overhead induced by runtime moni- 
toring. The tool is able to generate significant loads to reveal edge case 
behaviour that may only emerge when the monitoring system is pushed 
to its limit. We evaluate our framework in two ways. First, we conduct 
sanity checks to assess the precision of the measurement mechanisms 
used, the repeatability of the results obtained, and the veracity of the 
behaviour emulated by our synthetic benchmark. We then showcase the 
utility of the features offered by our tool in a two-part RV case study. 


Keywords: Runtime verification - Synthetic benchmarking - Software 
performance evaluation - Concurrent systems 


1 Introduction 


Large-scale software design has shifted from the classic monolithic architecture 
to one where applications are structured in terms of independently-executing 
asynchronous components [17]. This shift poses new challenges to the validation 
of such systems. Runtime Verification (RV) [9,27] is a post-deployment technique 
that is used to complement other methods such as testing [46] to assess the func- 
tional (e.g. correctness) and non-functional (e.g. quality of service) aspects of 
concurrent software. RV relies on instrumenting the system to be analysed with 
monitors, which inevitably introduce runtime overhead that should be kept min- 
imal [9]. While the worst-case complexity bounds for monitor-induced overheads 
can be calculated via standard methods (see, e.g. [40,14,1,28]), benchmarking is, 
by far, the preferred method for assessing these overheads [9,27]. One reason for 


* Supported by the doctoral student grant (No: 207055-051) and the TheoFoMon 
project (No: 163406-051) under the Icelandic Research Fund, the BehAPI project 
funded by the EU H2020 RISE under the Marie Skłodowska-Curie action 
(No: 778233), the ENDEAVOUR Scholarship Scheme (Group B, national funds), 
and the MIUR project PRIN 2017FTXR7S IT MATTERS. 

© The Author(s) 2021 
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this choice is that benchmarks tend to be more representative of the overhead 
observed in practice [30,15]. Benchmarks also provide a common platform for 
gauging workloads, making it possible to compare different RV tool implemen- 
tations, or rerun experiments to reproduce and confirm existing results. 


The utility of a benchmarking tool typically rests on two aspects: (i) the 
coverage of scenarios of interest, and (ii) the quality of runtime metrics col- 
lected by the benchmark harness. To represent scenarios of interest, benchmark- 
ing tools generally employ suites of third-party off-the-shelf (OTS) programs 
(e.g. [60,11,59]). OTS software is appealing because it is readily usable and in- 
herently provides realistic scenarios. By and large, benchmarks rely on a range of 
OTS programs to broaden the coverage of real-world scenarios (e.g. DaCapo [11] 
uses 11 open-source libraries). Yet, using OTS programs as benchmarks poses 
challenges. By design, these programs do not expose hooks that enable harnesses 
to easily and accurately gather the runtime metrics of interest. When OTS soft- 
ware is treated as a black box, benchmarks become harder to control, impacting 
their ability to produce repeatable results. OTS software-based benchmarks are 
also limited when inducing specific edge cases—this aspect is critical when as- 
sessing the safety of software, such as runtime monitors, that are often assumed 
to be dependable. Custom-built synthetic programs (e.g. [35]) are an alternative 
way to perform benchmarking. These tend to be less popular due to the per- 
ceived drawbacks associated with developing such programs from scratch, and 
the lack of ‘real-world’ behaviour intrinsic to benchmarks based on OTS soft- 
ware. However, synthetic benchmarks offer benefits that offset these drawbacks. 
For example, specialised hooks can be built into the synthetic set-up to collect 
a broad range of runtime metrics. Moreover, synthetic benchmarks can also be 
parametrised to emulate variations on the same core benchmark behaviour; this 
is usually harder to achieve via OTS programs that implement narrow use cases. 


Established benchmarking tools such as SPECjvm2008 [60], DaCapo [11], 
ScalaBench [59] and Savina [35|—developed for the JVM—feature extensively in 
the RV literature, e.g. see [48,19,18,54,13,45]. Apart from [45], these works assess 
the runtime overhead solely in terms of the execution slowdown, i.e., the differ- 
ence in running time between the system fitted with and without monitors. Re- 
cently, the International RV competition (CRV) [8] advocated for other metrics, 
such as memory consumption, to give a more qualitative view of runtime over- 
head. We hold that RV set-ups that target concurrency benefit from other facets 
of runtime behaviour, such as the response time, that captures the overhead be- 
tween communicating components. Tangibly, this metric reflects the perceived 
reactiveness from an end-user standpoint (e.g. interactive apps) [50,61,58,21]; 
more generally, it describes the service degradation that must be accounted for 
to ensure adequate quality of service [15,39]. Arguably, benchmarking tools like 
the ones above (e.g. Savina) should provide even more. Often, RV set-ups for 
concurrent systems need to scale in response to dynamic changes, and the capac- 
ity for a benchmark to emulate high loads cannot be overstated. In actual fact, 
these loads are known to assume characteristic profiles (e.g. spikes or uniform 
rates), which are hard to administer with the benchmarks mentioned earlier. 
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The state of the art in benchmarking for concurrent RV suffers from an- 
other issue. Existing benchmarks—conceived for validating other tools—are re- 
purposed for RV and often fail to cater for concurrent scenarios where RV is 
realistically put to use. SPECjvm2008, DaCapo, and ScalaBench lack workloads 
that leverage the JVM concurrency primitives [52]; meanwhile, [12] shows that 
the Savina microbenchmarks are essentially sequential, and that the rest of the 
programs in the suite are sufficiently simple to be regarded as microbenchmarks 
too. The CRV suite mostly targets monolithic software with limited concurrency, 
where the potential for scaling up to high loads is, therefore, severely curbed. 

This paper presents a benchmarking framework for evaluating runtime mon- 
itoring tools written for verification purposes. Our tool focusses on component 
systems for asynchronous message-passing concurrency. It generates synthetic 
system models following the master-slave architecture [61]. The master-slave ar- 
chitecture is pervasive in distributed (e.g. DNS, IoT) and concurrent (e.g. web 
servers, thread pools) systems [61,29], and lies at the core of the MapReduce 
model [22] supported by Big Data frameworks such as Hadoop [63]. This justi- 
fies our aim to build a benchmarking tool targeting this architecture. Concretely: 


— We detail the design of a configurable benchmark that emulates various 
master-slave models under commonly-observed load profiles, and gathers dif- 
ferent metrics that give a multi-faceted view of runtime overhead, Sec. 2. 

— We demonstrate that our synthetic benchmarks can be engineered to ap- 
proximate the realistic behaviour of web server traffic with high degrees of 
precision and repeatability, Sec. 3.1. 

— We present a case study that (i) shows how the load profiles and parametris- 
ability of our benchmarks can produce edge cases that can be measured 
through our performance metrics to asses runtime monitoring tools in a 
comprehensive manner, and (ii) confirms that the results from (i) coincide 
with those obtained via a real-world use case using OTS software, Sec. 3.2. 


2 Benchmark Design and Implementation 


Our set-up can emulate a range of system models and subject them to various 
load types. We consider master-slave architectures, where one central process, 
called the master, creates and allocates tasks to slave processes [61]. Slaves 
work concurrently on tasks, relaying the result to the master when ready; the 
latter then combines these results to yield the final output. Our slaves are an 
abstraction of sets of cooperating processes that can be treated as a single unit. 


2.1 Approach 


We target concurrent applications that execute on a single node. Nevertheless, 
our design adheres to three criteria that facilitate its extension to a distributed 
setting. Specifically, components: (i) share neither a common clock, (ii) nor 
memory, and (iii) communicate via asynchronous messages. Our present set-up 
assumes that communication is reliable and components do not fail. 
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Load generation. Load on the system is induced by the master when it creates 
slave processes and allocates tasks. The total number of slaves in one run can be 
set via the parameter n. Tasks are allocated to slave processes by the master, 
and consist of one or more work requests that a slave receives, handles, and relays 
back. A slave terminates its execution when all of its allocated work requests have 
been processed and acknowledged by the master. The number of work requests 
that can be batched in a task is controlled by the parameter w; the actual batch 
size per slave is then drawn randomly from a normal distribution with mean 
u=w and standard deviation o= p x 0.02. This induces a degree of variability in 
the amount of work requests exchanged between master and slaves. The master 
and slaves communicate asynchronously: an allocated work request is delivered 
to a slave process’ incoming work queue where it is eventually handled. Work 
responses issued by a slave are queued and processed similarly on the master. 


Load configuration. We consider three load profiles (see fig. 3 for examples) that 
determine how the creation of slaves is distributed along the load timeline t. 
The timeline is modelled as a sequence of discrete logical time units representing 
instants at which a new set of slaves is created by the master. Steady loads 
replicate executions where a system operates under stable conditions. These are 
modelled on a homogeneous Poisson distribution with rate A, specifying the mean 
number of slaves that are created at each time instant along the load timeline 
with duration t=[n/A]. Pulse loads emulate settings where a system experiences 
gradually increasing load peaks. The Pulse load shape is parametrised by t and 
the spread, s, that controls how slowly or sharply the system load increases as it 
approaches its maximum peak, halfway along t. Pulses are modelled on a normal 
distribution with »=t/2 and o=s. Burst loads capture scenarios where a system 
is stressed due to load spikes; these are based on a log-normal distribution with 
pw=In(m?/\/p? +m?) and o=\/In(1+p?/m?), where m=t/2, and parameter p 
is the pinch controlling the concentration of the initial load burst. 


Wall-clock time. A load profile created for a logical timeline t is put into effect 
by the master process when the system starts running. The master does not 
create the slave processes that are set to execute in a particular time unit in one 
go, since this naïve strategy risks saturating the system, deceivingly increasing 
the load. In doing so, the system may become overloaded not because the mean 
request rate is high, but because the created slaves overwhelm the master when 
they send their requests all at once. We address this issue by introducing the 
notion of concrete time that maps one discrete time unit in t to a real time period, 
m. The parameter 7 is given in milliseconds (ms), and defaults to 1000 ms. 


Slave scheduling. The master process employs a scheduling scheme to distribute 
the creation of slaves uniformly across the time period z. It makes use of three 
queues: the Order queue, Ready queue, and Await queue, denoted by Qo, Qr, 
and Qa respectively. Qo is initially populated with the load profile, step @ in 
fig. la. The load profile consists of an array with t elements—each corresponding 
to a discrete time instant in t—where the value | of every element indicates the 
number of slaves to be created at that instant. Slaves, S1,52,...,5;,, are scheduled 
and created in rounds, as follows. The master picks the first element from Qo 
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Fig. 1: Master M scheduling slave processes S; and allocating work requests 


to compute the upcoming schedule, step @, that starts at the current time, 
c, and finishes at c+ 7. A series of l time points, p,,p2,.-.,p1, in the schedule 
period m are cumulatively calculated by drawing the next p; from a normal 
distribution with u=[7/l] and ø= ux 0.1. Each time point stipulates a moment 
in wall-clock time when a new slave S; is to be created; this set of time points 
is monotonic, and constitutes the Ready queue, Qr, step ©. The master checks 
Qr, step @ in fig. 1b, and creates the slaves whose time point p; is smaller 
than or equal to the current wall-clock timet, steps © and © in fig. 1b. The 
time point p; of a newly-created slave is removed from Qo, and an entry for 
the corresponding slave S; is appended to the Await queue Qa; this is shown 
in step © for Sı and S2. Slaves in Qa are now ready to receive work requests 
from the master process, e.g. step ®©. Qa is traversed by the master at this 
stage so that work requests can be allocated to existing slaves. The master 
continues processing queue Qr in subsequent rounds, creating slaves, issuing 
work requests, and updating Qr and Qa accordingly as shown in steps @-@) 


4 We assume that the platform scheduling the master and slave processes is fair. 
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in fig. Ic. At any point, the master can receive responses, e.g. step @ in fig. 1d; 
these are buffered inside the masters’ incoming work queue and handled once 
the scheduling and work allocation phases are complete. A fresh batch of slaves 
from Qo is scheduled by the master whenever Qr becomes empty, step @), and 
the described procedure is repeated. The master stops scheduling slaves when all 
the entries in Qo are processed. It then transitions to work-only mode, where it 
continues allocating work requests and handling incoming responses from slaves. 


Reactiveness and task allocation. Systems generally respond to load with dif- 
fering rates, due to the computational complexity of the task at hand, IO, or 
slowdown when the system itself becomes gradually loaded. We simulate these 
phenomena using the parameters Pr(send) and Pr(recv). The master interleaves 
the processing of work requests to allocate them uniformly among the various 
slaves: Pr(send) and Pr(recv) bias this behaviour. Specifically, Pr(send) con- 
trols the probability that a work request is sent by the master to a slave, whereas 
Pr(recv) determines the probability that a work response received by the master 
is processed. Sending and receiving is turn-based and modelled on a Bernoulli 
trial. The master picks a slave S; from Qa, and sends at least one work request 
when X < Pr(send), i.e., the Bernoulli trial succeeds; X is drawn from a uni- 
form distribution on the interval [0,1]. Further requests to the same slave are 
allocated following this scheme (steps ©, G@ and in fig. 1) and the entry for 
S; in Qa is updated accordingly with the number of work requests remaining. 
When X > Pr(send), i.e., the Bernoulli trial fails, the slave misses its turn, and 
the next slave in Qa is picked. The master also queries its incoming work queue 
to determine whether a response can be processed. It dequeues one response 
when X < Pr(recv), and the attempt is repeated for the next response in the 
queue until X > Pr(recv). The master signals slaves to terminate once it ac- 
knowledges all of their work responses (e.g. step @). Due to the load imbalance 
that may occur when the master becomes overloaded with work responses re- 
layed by slaves, dequeuing is repeated |Qa| times. This encourages an even load 
distribution in the system as the number of slaves fluctuates at runtime. 


2.2 Realisability 


The set-up detailed in sec. 2.1 is easily translatable to the actor model of compu- 
tation [2]. In this model, the basic units of decomposition are actors: concurrent 
entities that do not share mutable memory with other actors. Instead, they in- 
teract via asynchronous messaging. Each actor owns an incoming message buffer 
called the mailbox. Besides sending and receiving messages, an actor can also fork 
other child actors. Actors are uniquely addressable via a dynamically-assigned 
identifier, often referred to as the PID. Actor frameworks such as Erlang [16], 
Akka [55] for Scala [51], and Thespian [53] for Python [44] implement actors as 
lightweight processes to enable highly-scalable architectures that span multiple 
machines. The terms actor and process are used interchangeably henceforth. 


Implementation. We use Erlang to implement the set-up of sec. 2.1. Our im- 
plementation maps the master and slave processes to actors, where slaves are 
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forked by the master via the Erlang function spawn(); in Akka and Thespian 
ActorContext.spawn() and Actor.createActor() can be respectively used to 
the same effect. The work request queues for both master and slave processes co- 
incide with actor mailboxes. We abstract the task computation and model work 
requests as Erlang messages. Slaves emulate no delay, but respond instantly to 
work requests once these have been processed; delay in the system can be in- 
duced via parameters Pr(send) and Pr(recv). To maximise efficiency, the Order, 
Ready and Await queues used by our scheduling scheme are maintained locally 
within the master. The master process keeps track of other details, such as the 
total number of work requests sent and received, to determine when the system 
should stop executing. We extend the parameters in sec. 2.1 with a seed parame- 
ter, r, to fix the Erlang pseudorandom number generator to output reproducible 
number sequences. 


2.3 Measurement Collection 


To give a multi-faceted view of runtime overhead, we extend the approach in [8] 
and, apart from the (i) mean execution duration, measured in seconds (s), we also 
collect the (ii) mean scheduler utilisation, as a percentage of the total available 
capacity, (iii) mean memory consumption, measured in GB, and, (iv) mean 
response time (RT), measured in milliseconds (ms). Our definition of runtime 
overhead encompasses all four metrics. Measurement taking largely depends on 
the platform on which the benchmark executes, and one often leverages platform- 
specific optimised functionality in order to attain high levels of efficiency. Our 
implementation relies on the functionality provided by the Erlang ecosystem. 


Sampling. We collect measurements centrally using a special process, called 
the Collector, that samples the runtime to obtain periodic snapshots of the 
execution environment (see fig. 2). Sampling is often necessary to induce low 
overhead in the system, especially in scenarios where the system components 
are sensitive to latency [32]. Our sampling frequency is set to 500 ms: this figure 
was determined empirically, whereby the measurements gathered are neither too 
coarse, nor excessively fine-grained such that sampling affects the runtime. Every 
sampling snapshot combines the four metrics mentioned above and formats them 
as records that are written asynchronously to disk to minimise IO delays. 

Performance metrics. Memory and scheduler readings are gathered via the Er- 
lang Virtual Machine (EVM). We sample scheduler—rather than CPU utilisation 
at the OS-level—since the EVM keeps scheduler threads momentarily spinning 
to remain reactive; this would inflate the metric reading. The overall system re- 
sponsiveness is captured by the mean RT metric. Our Collector exposes a hook 
that the master uses to obtain unique timestamps, step © in fig. 2. These are em- 
bedded in all work request messages the master issues to slaves. Each timestamp 
enables the Collector to track the time taken for a message to travel from the 
master to a slave and back, including the time it spends in the master’s mailbox 
until dequeued, i.e., the round-trip in steps @-©. To efficiently compute the 
RT, the Collector samples the total number of messages exchanged between the 
master and slaves, and calculates the mean using Welford’s online algorithm [62]. 
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Fig. 2: Collector tracking the round-trip time for work requests and responses 


3 Evaluation 


We evaluate our synthetic benchmarking tool developed as described in Sec. 2 
in a number of ways. In sec. 3.1, we discuss sanity checks for its measurement 
collection mechanisms, and assess the repeatability of the results obtained from 
the synthetic system executions. Crucially, sec. 3.1 provides evidence that the 
benchmarking tool is sufficiently expressive to cover a number of execution pro- 
files that are shown to emulate realistic scenarios. Sec. 3.2 demonstrates the 
utility of the features offered by our tool for the purposes of assessing RV tools. 


Experiment set-up. We define an experiment to consist of ten benchmarks, each 
performed by running the system set-up with incremental loads. Our experiments 
were performed on an Intel Core i7 M620 64-bit machine with 8GB of memory, 
running Ubuntu 18.04 LTS and Erlang/OTP 22.2.1. 


3.1 Benchmark Expressiveness and Veracity 


The parameters for the tool detailed in sec. 2.1 can be configured to model 
a range of master-slave scenarios. However, not all of these configurations are 
meaningful in practice. For example, setting Pr(send) =0 does not enable the 
master to allocate work requests to slaves; with Pr(send) = 1, this allocation is 
enacted sequentially, defeating the purpose of a concurrent master-slave system. 
In this section, we establish a set of parameter values that model experiment set- 
ups whose behaviour approximates that of master-slave systems typically found 
in practice. Our experiments are conducted with n=500k slaves and w=100 work 
requests per slave. This generates =n xw x (work requests and responses) =100M 
message exchanges between the master and slaves. We initially fix Pr(send) = 
Pr(recv) =0.9, and choose a Steady (i.e., Poisson process) load profile since this 
features in industry-strength load testing tools such as Tsung [49] and JMeter [3]. 
Fig. 3 shows the load applied at each benchmark run, e.g. on the tenth run, the 
benchmark uses ~ 5k slaves/s. The total loading time is set to t= 100s. 


Measurement precision. A series of trials were conducted to select the appro- 
priate sampling window size for the RT. This step is crucial because it directly 
affects the capability of the benchmark to scale in terms of its number of slave 
processes and work requests. Our RT sampling of sec. 2.3 (see also fig. 2) was 
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calibrated by taking various window sizes over numerous runs for different load 
profiles of ~ 1M slaves. The results were compared to the actual mean calcu- 
lated on all work request and response messages exchanged between master and 
slaves. Window sizes close to 10% yielded the best results (~ +1.4% discrep- 
ancy from the actual RT). Smaller window sizes produced excessive discrepancy; 
larger sizes induced noticeably higher system loads. We also cross-checked the 
precision of our sampling method of the scheduler utilisation against readings 
obtained via the Erlang Observer tool [16] to confirm that these coincide. 


Experiment repeatability. Data variability affects the repeatability of experi- 
ments. It also plays a role when determining the number of repeated readings, k, 
required before the data measured is deemed sufficiently representative. Choos- 
ing the lowest k is crucial when experiment runs are time consuming. The coef- 
ficient of variation (CV)—i.e., the ratio of the standard deviation to the mean, 
CV = Z x 100—can be used to establish the value of k empirically, as follows. 
Initially, the CV; for one batch of experiments for some number of repetitions k 
is calculated. The result is then compared to the CV x for the next batch of repe- 
titions k’=k+5, where b is the step size. When the difference between successive 
CV metrics k’ and k is sufficiently small (for some percentage €), the value of k 
is chosen, otherwise the described procedure is repeated with k’. Crucially, this 
condition must hold for all variables measured in the experiment before k can 
be fixed. For the results presented next, the CV values were calculated manually. 
The mechanism that determines the CV automatically is left for future work. 


Data variability. The data variability between experiments can be reduced by 
seeding the Erlang pseudorandom number generator (parameter r in sec. 2.2) 
with a constant value. This, in turn, tends to require fewer repeated runs be- 
fore the metrics of interest—scheduler utilisation, memory consumption, RT, 
and execution duration—converge to an acceptable CV. We conduct experiment 
sets with three, six and nine repetitions. For the majority of cases, the CV for 
our metrics is lower when a fixed seed is used, by comparison to its unseeded 
counterpart. In fact, very low CV values for the scheduler utilisation, memory 
consumption, RT, and execution duration, 0.17%, 0.15 %, 0.52% and 0.47 % re- 
spectively, were obtained with three repeated runs. We thus set the number of 
repetitions to three for all experiment runs in the sequel. Note that fixing the 
seed still permits the system to exhibit a modicum of variability that stems from 
the inherent interleaved execution of components due to process scheduling. 


Load profiles. Our tool is expressive enough to generate the load profiles intro- 
duced in sec. 2.1 (see fig. 3), enabling us to gauge the behaviour of monitoring 
set-ups under varying forms of loads. These loads make it possible to mock spe- 
cific system scenarios that test different implementation aspects. For example, a 
benchmark configured with load surges could uncover buffer overflows in a par- 
ticular monitoring implementation that only arise under stress when the length 
of the request queue exceeds some preset length. 


System reactivity. The reactivity of the master-slave system correlates with the 
idle time of each slave which, in turn, affects the capacity of the system to absorb 
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Fig. 3: Steady, Pulse and Burst load distributions of 500 k slaves for 100 s 


overheads. Since this can skew the results obtained when assessing overheads, it is 
imperative that the benchmarking tool provides methods to control this aspect. 
The parameters Pr(send) and Pr(recv) regulate the speed with which the system 
reacts to load. We study how these parameters affect the overall performance of 
system models set up with Pr(send) = Pr(recv) € {0.1,0.5,0.9}. The results are 
shown in fig. 4, where each metric (e.g. memory consumption) is plotted against 
the total number of slaves. At Pr(send)=Pr(recv)=0.1, the system has the lowest 
RT out of the three configurations (bottom left), as indicated by the gentle linear 
increase of the plot. One may expect the RT to be lower for the system models 
configured with probability values of 0.5 and 0.9. However, we recall that with 
Pr(send) =0.1, work requests are allocated infrequently by the master, so that 
slaves are often idle, and can readily respond to (low numbers of) incoming work 
requests. At the same time, this prolongs the execution duration, when compared 
to that of the system set with Pr(send) = Pr(recv) € {0.5,0.9} (bottom right). 
This effect of slave idling can be gleaned from the relatively lower scheduler 
utilisation as well (top left). Idling increases memory consumption (top right), 
since slaves created by the master typically remain alive for extended periods. 
By contrast, the plots set with Pr(send)=Pr(recv) € {0.5,0.9} exhibit markedly 
gentler gradients in the memory consumption and execution duration charts; 
corresponding linear slopes can be observed in the RT chart. This indicates that 
values between 0.5 and 0.9 yield system models that: (i) consume reasonable 
amounts of memory, (ii) execute in respectable amounts of time, and (iii) main- 
tain tolerable RT. Since master-slave architectures are typically employed in 
settings where high throughput is demanded, choosing values smaller than 0.5 
goes against this principle. In what follows, we opt for Pr(send)=Pr(recv) =0.9. 


Emulation veracity. Our benchmarks can be configured to closely model real- 
istic web server traffic where the request intervals observed at the server are 
known to follow a Poisson process [31,43,37]. The probability distribution of 
the RT of web application requests is generally right-skewed, and approximates 
log-normal [31,20] or Erlang distributions [37]. We conduct three experiments 
using Steady loads fixed with n=10k for Pr(send) =Pr(recv) € {0.1,0.5,0.9} to 
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Fig. 4: Performance benchmarks of system models for Pr(send) and Pr(recv) 


establish whether the RT in our system set-ups resembles the aforementioned dis- 
tributions. Our results, summarised in fig. 5, were obtained by estimating the pa- 
rameters for a set of candidate probability distributions (e.g. normal, log-normal, 
gamma, etc.) using maximum likelihood estimation [56] on the RT obtained from 
each experiment. We then performed goodness-of-fit tests on these parametrised 
distributions using the Kolmogorov-Smirnov test, selecting the most appropriate 
RT fit for each of the three experiments. The fitted distributions in fig. 5 indi- 
cate that the RT of our system models follows the findings reported in [31,20,37]. 
This makes a strong case in favour of our benchmarking tool striking a balance 
between the realism of benchmarks based on OTS programs and the controlla- 
bility offered by synthetic benchmarking. Lastly, we point out that fig. 5 matches 
the observations made in fig. 4, which show an increase in the mean RT as the 
system becomes more reactive. This is evident in the histogram peaks that grow 
shorter as Pr(send) =Pr(recv) progresses from 0.1 to 0.9. 


3.2 Case Study 


We demonstrate how our benchmarking tool can be used to assess the runtime 
overhead comprehensively via a concurrent RV case study. By controlling the 
benchmark parameters and subjecting the system to specific workloads, we show 
that our multi-faceted view of overhead reveals nuances in the observed runtime 
behaviour, benefitting the interpretation of empirical results. We further assess 
the veracity of these synthetic benchmarks against the overhead measured from 
a use case that considers industry-strength OTS applications. 
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Fig. 5: Fitted probability distributions on RT for Steady loads for n=10k 


The RV Tool We use a RV tool to objectively compare the conclusions de- 
rived from our synthetic benchmarks against those obtained from the experiment 
set up with the OTS applications. The tool under scrutiny targets concurrent 
Erlang programs [4]. It synthesises automata-like monitors from sHML specifi- 
cations |26] and inlines them into the system via code injection by manipulating 
the program abstract syntax tree. Inline instrumentation underlies various other 
state-of-the-art RV tools, such as JavaMOP [36], MarQ [54], Java-MaC [38] and 
RiTHM [47]. sHML is a fragment of the Hennessy-Milner Logic with recur- 
sion [41] that can express all regular safety properties [26]. The tool augments 
it to handle pattern matching and data dependencies for three kinds of event 
patterns, namely send and receive actions, denoted by ! and ? respectively, and 
process crash, denoted by x. This suffices to specify properties of both the master 
and slave processes, resulting in the set-up depicted in fig. 6a. For instance, the 
recursive property Ys describes an invariant of the master-slave communication 
protocol (from the slave’s point of view), stating that ‘a slave processing integer 
successor requests should not crash’: 
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The key construct in sHML is the modal formula [p|y, stating that whenever a 
satisfying system exhibits an event e matching pattern p, its continuation then 
satisfies y. In property ys, the invariant—denoted by recursion binder maxX — 
asserts that a slave Slu does not crash, specified by sub-formula @. It further 
stipulates in sub-formula @ that when a request-carrying payload, Req is re- 
ceived, @D, Slv cannot crash, @D, and if the slave replies to Req with the pay- 
load Req+1, the property recurses on variable X, @2. Action patterns use two 
types of value variables: binders, \z, that are pattern-matched to concrete values 
learnt at runtime, and variable instances, z, that are bound by the respective 
binders and instantiated to concrete data via pattern matching at runtime. This 
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induces the usual notion of free and bound value variables; we assume closed 
terms. For example, when checking property ys against the trace event pid? 42, 
the analysis unfolds the sub-formula guarded by maxX, matching the event with 
the pattern \Siv?\Req in @D. Variables Slv and Req are substituted with pid 
and 42 respectively in property ys, leaving the residual formula: 


[\Slu x] fF A 


a a a e fe [Slux]fF A [Slv!(Req+1)]X) 


The RV tool under scrutiny produces inlined monitor code that executes in the 
same process space of system components (see fig. 6a), yielding the lowest pos- 
sible amount of runtime overhead. This enables us to scale our benchmarks to 
considerably high loads. Our experiments focus on correctness properties that 
are parametric w.r.t. to system components [7,19,54,48]: with this approach, 
monitors need not interact with one another and can reach verdicts indepen- 
dently. Verdicts are communicated by monitors to a central entity that records 
the expected number of verdicts in order to determine when the experiment can 
be stopped. The set of properties used in our benchmarks translate to monitors 
that loop continually to exert the maximum level of runtime overhead possible. 

Fig. 6b shows the monitor synthesised from property Ys, consisting of states 
Qo, Qi, the rejection state X, and inconclusive state ?. The rejection state cor- 
responds to a violation of the property, i.e., ff, whereas the inconclusive state 
is reached when the analysed trace events do not contain enough information 
to enable the monitor to transition to any other state. Both of these states are 
sinks, modelling the irrevocability of verdicts [24,26]. The modality [\ Slv? \ Reg] 
in property Ys corresponds to the transition between Qo and Qı in fig. 6b. The 
monitor follows this transition when it analyses the trace event pid; ?d; exhibited 
by the slave with PID pid, when it receives data payload d, from the master; 
as a side effect, the transition binds the variable Slv to pid; and Reg to dı in 
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Fig. 6: Synthesised monitors instrumented with master and slave processes 
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state Qı. From Qı, the monitor transitions to Qo only when the event pid,!d, 
is analysed, where d2 = d +1 and pid, is the slave PID (previously) bound to 
Slv. From Qo and Qj, the rejection state X can be reached when a crash event 
is analysed. In the case of Qo, the transition to X is followed for any crash event 
_* (the wildcard _ denotes the anonymous variable). By contrast, the monitor 
reaches X from Qı only when the slave with PID pid, crashes, otherwise it tran- 
sitions to the inconclusive state ?. Other transitions from Qo and Q; leading to 
? follow a similar reasoning. Interested readers are encouraged to consult [25,6,5] 
for more information on the specification logic and monitor synthesis. 


Synthetic Benchmarks We set the total number of slaves to n= 20k for mod- 
erate loads and n=500k for high loads; Pr(send) = Pr(recv) is fixed at 0.9 as in 
sec. 3.1. These configurations generate ~n x wx (work requests and responses) = 
4M and 100M messages respectively to produce 8M and 200M analysable trace 
events per run. The pseudorandom number generator is seeded with a constant 
value and three experiment repetitions are performed for the Steady, Pulse and 
Burst load profiles (see fig. 3). A loading time of t=100s is used. Our results are 
summarised in figs. 7 and 8. Each chart in these figures plots the particular per- 
formance metric (e.g. memory consumption) for the system without monitors, 
i.e., the baseline, together with the overhead induced by the RV monitors. 


Moderate loads. Fig. 7 shows the plots for the system set with n = 20k. These 
loads are similar to those employed by the state-of-the-art frameworks to evalu- 
ate component-based runtime monitoring, e.g. [57,7,10,23,48] (ours are slightly 
higher). We remark that none of the benchmarks used in these works consider 
different load profiles: they either model load on a Poisson process, or fail to 
specify the kind of load used. In fig. 7, the execution duration chart (bottom 
right) shows that, regardless of the load profile used, the running time of each 
experiment is comparable to the baseline. With the moderate size of 20k slaves, 
the execution duration on its own does not give a detailed enough view of run- 
time overhead, despite the fact that our benchmarks provide a broad coverage in 
terms of the Steady, Pulse and Burst load profiles. This trend is mirrored in the 
scheduler utilisation plot (top left), where both baseline and monitored system 
induce a constant load of ~ 17.5%. On this account, we deem these results to 
be inconclusive. By contrast, our three load profiles induce different overhead 
for the RT (bottom left), and, to a lesser extent, the memory consumption plots 
(top right). Specifically, when the system is subjected to a Burst load, it exhibits 
a surge in the RT for the baseline and monitored system alike, at ~ 16k slaves. 
While this is not reflected in the consumption of memory, the Burst plots do 
exhibit a larger—albeit linear—rate of increase in memory when compared to 
their Steady and Pulse counterparts. The latter two plots once again show anal- 
ogous trends, indicating that both Steady and Pulse loads exact similar memory 
requirements and exhibit comparable responsiveness under the respectable load 
of 20k slaves. Crucially, the data plots in fig. 7 do not enable us to confidently 
extrapolate our results. The edge case in the RT chart for Burst plots raises the 
question of whether the surge in the trend observed at ~ 16k remains consistent 
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Fig. 7: Mean runtime overhead for master and slave processes (20 k slaves) 


when the number of slaves goes beyond 20k. Similarly, although for a different 
reason, the execution duration plots do not allow us to distinguish between the 
overhead induced by monitors for different loads on this small scale—this occurs 
due to the perturbations introduced by the underlying OS (e.g. scheduling other 
processes, IO, etc.) that affect the sensitive time keeping of benchmarks. 


High loads. We increase the load to n = 500k slaves to determine whether our 
benchmark set-up can adequately scale, and show how the monitored system per- 
forms under stress. The RT chart in fig. 8 indicates that for Burst loads (bottom 
left), the overhead induced by monitors grows linearly in the number of slaves. 
This contradicts the results in fig. 7, confirming our supposition that moderate 
loads may provide scant empirical evidence to extrapolate to general conclu- 
sions. However, the memory consumption for Burst loads (top right) exhibits 
similar trends to the ones in fig. 7. Subjecting the system to high loads renders 
discernible the discrepancy between the RT and memory consumption gradients 
for the Steady and Pulse plots that appeared to be similar under the moderate 
loads of 20k slaves. Considering the execution duration chart (bottom right of 
fig. 8) as the sole indicator of overhead could deceivingly suggest that runtime 
monitoring induces virtually identical overhead for the distinct load profiles of 
fig. 3. However, this erroneous observation is easily refuted by the memory con- 
sumption and RT plots that show otherwise. This stresses the merit of gathering 
multi-faceted metrics to assist in the interpretation of runtime overhead. 

We extend the argument for multi-faceted views to the scheduler utilisation 
metric in fig. 8 that reveals a subtle aspect of our concurrent set-up. Specifically, 
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Fig. 8: Mean runtime overhead for master and slave processes (500k slaves) 


the charts show that while the execution duration, RT and memory consumption 
plots grow in the number of slave processes, scheduler utilisation stabilises at ~ 
22.7%. This is partly caused by the master-slave design that becomes susceptible 
to bottlenecks when the master is overloaded with requests [61]. In addition, 
the preemptive scheduling of the EVM [16] ensures that the master shares the 
computational resources of the same machine with the rest of the slaves. We 
conjecture that, in a distributed set-up where the master resides on a dedicated 
node, the overall system throughput may be further pushed. Fig. 8 also attests 
to the utility of having a benchmarking framework that scales considerably well 
to increase the chances of detecting potential trends. For instance, the evidence 
gathered earlier in fig. 7 could have misled one to assert that the RV tool under 
scrutiny scales poorly under Burst loads of moderate and larger sizes. 


An OTS Application Use Case We evaluate the overheads induced by the 
RV tool under scrutiny using a third-party industry-strength web server called 
Cowboy [33], and show that the conclusions we draw are in line with those re- 
ported earlier for our synthetic benchmark results. Cowboy is written in Erlang 
and built on top of Ranch [34]—a socket acceptor pool for TCP protocols that 
can be used to develop custom network applications. Cowboy relies on Ranch 
to manage its socket connections, but delegates HTTP client requests to pro- 
tocol handlers that are forked dynamically by the web server to handle each 
request independently. This architecture follows closely our master-slave set-up 
of sec. 2.1 which abstracts details such as TCP connection management and 
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Fig. 9: Mean overhead for synthetic and Cowboy benchmarks (20k threads) 


HTTP protocol parsing. We generate load on Cowboy using the popular stress 
testing tool JMeter [3] to issue HTTP requests from a dedicated machine resid- 
ing on the same network where Cowboy is hosted. The latter machine is the one 
used in the experiments discussed earlier. To emulate the typical behaviour of 
web clients (e.g. browsers) that fetch resources via multiple HTTP requests, our 
Cowboy application serves files of various sizes that are randomly accessed by 
JMeter during the benchmark. In our experiments, we monitored fragments of 
the Cowboy and Ranch communication protocol used to handle client requests. 


Moderate loads. Fig. 9 plots our results for Steady loads from fig. 7, together 
with the ones obtained from the Cowboy benchmarks; JMeter did not enable 
us to reproduce the Pulse and Burst load profiles. For our Cowboy benchmarks, 
we fixed the total number of JMeter request threads to 20k over the span of 
100s, where each thread issued 100 HTTP requests. This configuration coincides 
with parameter settings used in the experiments of fig. 7. In fig. 9, the sched- 
uler utilisation, memory consumption and RT charts (top, bottom left) show 
a correspondence between the baseline plots of our synthetic benchmarks and 
those taken with Cowboy and JMeter. This indicates that, for these metrics, 
our synthetic system model exhibits analogous characteristics to the ones of the 
OTS system, under the chosen load profile. The argument can be extended to 
the monitored versions of these systems which follow identical trends. We point 
out the similarity in the RT trends of our synthetic and Cowboy benchmarks, 
despite the fact that the latter set of experiments were conducted over a local 
network. This suggests that, for our single-machine configuration, the synthetic 
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master-slave benchmarks manage to adequately capture local network condi- 
tions. The gaps separating the plots of the two experiment set-ups stem from the 
implementation specifics of Cowboy and our synthetic model. This discrepancy 
in measurements also depends on the method used to gather runtime metrics, 
e.g. JMeter cannot sample the EVM directly, and measures CPU as opposed to 
scheduler utilisation. The deviation in execution duration plots (bottom right) 
arises for the same reason. 


High loads. Our efforts to run tests with 500k request threads where stymied by 
the scalability issues we experienced with Cowboy and JMeter on our set-up. 


4 Conclusion 


Concurrent RV necessitates benchmarking tools that can scale dynamically to 
accommodate considerable load sizes, and are able to provide a multi-faceted view 
of runtime overhead. This paper presents a benchmarking tool that fulfils these 
requirements. We demonstrate its implementability in Erlang, arguing that the 
design is easily instantiatable to other actor frameworks such as Akka and Thes- 
pian. Our set-up emulates various system models through configurable parame- 
ters, and scales to reveal behaviour that emerges only when software is pushed 
to its limit. The benchmark harness gathers different performance metrics, offer- 
ing a multi-faceted view of runtime overhead that, to wit, other state-of-the-art 
tools do not currently offer. Our experiments demonstrate that these metrics 
benefit the interpretation of empirical measurements: they increase visibility 
that may spare one from drawing insufficiently general, or otherwise, erroneous 
conclusions. We establish that—despite its synthetic nature—our master-slave 
model faithfully approximates the mean response times observed in realistic web 
server traffic. We also compare the results of our synthetic benchmarks against 
those obtained from a real-world use case to confirm that our tool captures the 
behaviour of this realistic set-up. It is worth noting that, while our empirical 
measurements of secs. 3.1 and 3.2 depend on the implementation language, our 
conclusions are transferrable to other frameworks, e.g. Akka and Play [42]. 


Related work. There are other less popular benchmarks targeting the JVM be- 
sides those mentioned in sec. 1. Renaissance [52] employs workloads that leverage 
the concurrency primitives of the JVM, focussing on the performance of com- 
piler optimisations similar to DaCapo and ScalaBench. These benchmarks gather 
metrics that measure software quality and complexity, as opposed to metrics that 
gauge runtime overhead. The CRV suite [8] aims to standardise the evaluation 
of RV tools, and mainly focusses on RV for monolithic programs. We are un- 
aware of RV-centric benchmarks for concurrent systems such as ours. In [43], the 
authors propose a queueing model to analyse web server traffic, and develop a 
benchmarking tool to validate it. Their model coincides with our master-slave 
set-up, and considers loads based on a Poisson process. A study of message- 
passing communication on parallel computers conducted in [31] uses systems 
loaded with different numbers of processes; this is similar to our approach. Im- 
portantly, we were able to confirm the findings reported in [43] and [31] (sec. 3.1). 
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Abstract. A program containing placeholders for unspecified statements 
or expressions is called an abstract (or schematic) program. Placeholder 
symbols occur naturally in program transformation rules, as used in 
refactoring, compilation, optimization, or parallelization. We present a 
generalization of automated cost analysis that can handle abstract pro- 
grams and, hence, can analyze the impact on the cost of program trans- 
formations. This kind of relational property requires provably precise 
cost bounds which are not always produced by cost analysis. There- 
fore, we certify by deductive verification that the inferred abstract cost 
bounds are correct and sufficiently precise. It is the first approach solving 
this problem. Both, abstract cost analysis and certification, are based on 
quantitative abstract execution (QAE) which in turn is a variation of 
abstract execution, a recently developed symbolic execution technique 
for abstract programs. To realize QAE the new concept of a cost invari- 
ant is introduced. QAE is implemented and runs fully automatically on 
a benchmark set consisting of representative optimization rules. 


1 Introduction 


We present a generalization of automated cost analysis that can handle pro- 
grams containing placeholders for unspecified statements. Consider the program 
Q = “i=0; while (i < t) {P; i++;}”, where P is any statement not modifying 
i or t. We call P an abstract statement; a program like Q containing abstract 
statements is called abstract program. The (exact or upper bound) cost of execut- 
ing P is described by a function acp(%) depending on the variables % occurring 
in P. We call this function the abstract cost of P. Assuming that executing any 
statement has unit cost and that t > 0, one can compute the (abstract) cost of 
Q as 2+ ¢t-(acp(%) +2) depending on acp and t. For any concrete instance of P, 
we can derive its concrete cost as usual and then obtain the concrete cost of Q 
simply by instantiating acp. In this paper, we define and implement an abstract 
cost analysis to infer abstract cost bounds. Our implementation consists of an 
automatic abstract cost analysis tool and an automatic certifier for the correct- 
ness of inferred abstract bounds. Both steps are performed with an approach 
called Quantitative Abstract Execution (QAE). 

Fine, but what is this good for? Abstract programs occur in program trans- 
formation rules used in compilation, optimization, parallelization, refactoring, 
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etc.: Transformations are specified as rules over program schemata which are 
nothing but abstract programs. If we can perform cost analysis of abstract pro- 
grams, we can analyze the cost effect of program transformations. Our approach 
is the first method to analyze the cost impact of program transformations. 


Automated Cost Analysis. Cost analysis occupies an interesting middle ground 
between termination checking and full functional verification in the static pro- 
gram analysis portfolio. The main problem in functional verification is that one 
has to come up with a functional specification of the intended behavior, as well 
as with auxiliary specifications including loop invariants and contracts [21]. In 
contrast, termination is a generic property and it is sufficient to come up with 
a suitable term order or ranking function [6]. For many programs, termination 
analysis is vastly easier to automate than verification.! 

Computation cost is not a generic property, but it is usually schematic: One 
fixes a class of cost functions (for example, polynomial) that can be handled. 
A cost analysis then must come up with parameters (degree, coefficients) that 
constitute a valid bound (lower, upper, exact) for all inputs of a given program 
with respect to a cost model (# of instructions, allocated memory, etc.). If this 
is performed bottom up with respect to a program’s call graph, it is possible to 
infer a cost bound for the top-level function of a program. Such a cost expression 
is often symbolic, because it depends on the program’s input parameters. 

A central technique for inferring symbolic cost of a piece of code with high 
precision is symbolic execution (SE) [9,25]. The main difficulty is to render SE 
of loops with symbolic bounds finite. This is achieved with loop invariants that 
generalize the behavior of a loop body: an invariant is valid at the loop head after 
arbitrarily many iterations. To infer sufficiently strong invariants automatically 
is generally an unsolved problem in functional verification, but much easier in the 
context of cost analysis, because invariants do not need to characterize functional 
behavior: it suffices that they permit to infer schematic cost expressions. 


Abstract Execution. To infer the cost of program transformation schemata re- 
quires the capability of analyzing abstract programs. This is not possible with 
standard SE, because abstract statements have no operational semantics. One 
way to reason about abstract programs is to perform structural induction over 
the syntactic definition of statements and expressions whenever an abstract sym- 
bol is encountered. Structural induction is done in interactive theorem prov- 
ing [7,31] to verify, e.g., compilers. It is labor-intensive and not automatic. In- 
stead, here we perform cost analysis of abstract programs via a recent generaliza- 
tion of SE called abstract execution (AE) [37,38]. The idea of AE is, quite simply, 
to symbolically execute a program containing abstract placeholder symbols for 
expressions and statements, just as if it were a concrete program. It might seem 


1 In theory, of course, proving termination is as difficult as functional verification. 
It is hard to imagine, for example, to find a termination argument for the Collatz 
function without a deep understanding of what it does. But automated termination 
checking works very well for many programs in practice. 
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counterintuitive that this is possible: after all, nothing is known about an ab- 
stract symbol. But this is not quite true: one can equip an abstract symbol with 
an abstract description of the behavior of its instances: a set of memory loca- 
tions its behavior may depend on, commonly called footprint and a (possibly 
different) set of memory locations it can change, commonly called frame [21]. 


Cost Invariants. In automated cost analysis, one infers cost bounds often from 
loop invariants, ranking functions, and size relations computed during SE [3,11, 
16,40]. For abstract programs, we need a more general concept, namely a loop 
invariant expressing a valid abstract cost bound at the beginning of any iteration 
(e.g., 2 + ix (acp(%) + 2) for the program Q above). We call this a cost invariant. 
This is an important technical innovation of this paper, increasing the modularity 
of cost analysis, because each loop can be verified and certified separately. 


Relational Cost Analysis. AE allows specifying and verifying relational program 
properties [37], because one can express rule schemata. This extends to QAE 
and makes it possible, for the first time, to infer and to prove (automatically), 
for example, the impact of program transformation on performance. 


Certification. Cost annotations inferred by abstract cost analysis, i.e., cost in- 
variants and abstract cost bounds, are automatically certified by a deductive ver- 
ification system, extending the approach reported in [4] to abstract cost and ab- 
stract programs. This is possible because the specification (i.e., the cost bound) 
and the loop (cost) invariants are inferred by the cost analyzer—the verification 
system does not need to generate them. 

To argue correctness of an abstract cost analysis is complex, because it must 
be valid for an infinite set of concrete programs. For this reason alone, it is 
useful to certify the abstract cost inferred for a given abstract program: during 
development of the abstract cost analysis reported here, several errors in abstract 
cost computation were detected—analysis of the failed verification attempt gave 
immediate feedback on the cause. We built a test suite of problems so that any 
change in the cost analyzer can be validated in the future. 

Certification is crucial for the correctness of quantitative relational prop- 
erties: The inferred cost invariants might not be precise enough to establish, 
e.g., that a program transformation does not increase cost for any possible pro- 
gram instance and run. This is only established at the certification stage, where 
relational properties are formally verified. A relational setting requires provably 
precise cost bounds. This feature is not offered by existing cost analysis methods. 


2 QAE by Example 


We introduce our approach and terminology informally by means of a motivat- 
ing example: Code Motion [1] is a compiler optimization technique moving a 
statement not affected by a loop from the beginning of the loop body to before 
the loop. This code transformation should preserve behavior provided the loop 
is executed at least once, but can be expected to improve computation effort, 
i.e. quantitative properties of the program, such as execution time and memory 
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m Program Before Program After 


int i = 0; 
//@ loop_invariant i > 0 && i< t; 
//@ cost_invariant 

i- (acp (t, w) + ace (t,z) + 2); 
//@ decreases t — i; 


int i = 0; 

//@ assignable x; 

//@ accessible t, w; 
//@ cost_footprint t, w; 
\abstract_statement P; 


while (i < t) { 
//@ assignable x; 
//@ accessible t, w; 
//@ cost_footprint t, w; 
\abstract_statement P; 
//@ assignable y; 
//@ accessible i,t, y, z; 
//@ cost_footprint t, z; //@ cost_footprint t, z; 
\abstract_statement Q; \abstract_statement Q; 
i++; i++; 


//@ loop_invariant i > 0 && i< t; 
//@ cost_invariant 
i- (acg (t,z) +2); 
//@ decreases t — i; 
while (i < t) { 
//@ assignable y; 
//@ accessible i,t, y, z; 


i 
//@ assert \cost == 2 + 
acp (t, w) ae GE (ace (i z) JT 2) ; 


} 
//@ assert \cost == 2 + 
t- (acp (t,w) + aca (t,z) + 2); 


m~~ Preconditions and Postconditions 


Inputs: t, w, x, y,z Precondition: t >0 Postcondition: \cost_1 > \cost_2 


Fig. 1: Motivating example on relational quantitative properties. 


consumption: The moved code block is executed just once in the transformed 
context, leading to less instructions (less energy consumed) and, in case it allo- 
cates memory, less memory usage. In the following we subsume any quantitative 
aspect of a program under the term cost expressed in an unspecified cost model 
with the understanding that it can be instantiated to specific cost measures, such 
as number of instructions, number of allocated bytes, energy consumed, etc. 


To formalize code motion as a transformation rule, we describe in- and out- 
put of the transformation schematically. Fig. 1 depicts such a schema in a lan- 
guage based on JAVA. An Abstract Statement (AS) with identifier Id, declared 
as “\abstract_statement Id;”, represents an arbitrary concrete statement. It is 
obviously unsafe to extract arbitrary, possibly non-invariant, code blocks from 
loops. For this reason, the AS P in question has a specification restricting the 
allowed behavior of its instances. For compatibility with JAVA we base our spec- 
ification language on the Java Modeling Language (JML) [27]. Specifications are 
attached to code via structured comments that are marked as JML by an “@” 
symbol. JML keyword “assignable” defines the memory locations that may oc- 
cur in the frame of an AS; similarly, “accessible” restricts the footprint. Fig. 1 
contains further keywords explained below. 


Input to QAE is the abstract program to analyze, including annotations 
(highlighted in light gray in Fig. 1) that express restrictions on the permitted 
instances of ASs. In addition to the frame and footprint, the cost footprint of an 
AS, denoted with the keyword “cost_footprint” , is a subset of its footprint listing 
locations the cost expressions in AS instances may depend on. In Fig. 1, the cost 
footprint of AS Q excludes accessible variables i and y. Annotations highlighted 
in dark gray are automatically inferred by abstract cost analysis and are input 
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for the certifier. As usual, loop invariants (keyword “loop_invariant” ) are needed 
to describe the behavior of loops with symbolic bounds. The loop invariant in 
Fig. 1 allows inferring the final value t of loop counter i after loop termination. 
To prove termination, the loop variant (keyword “decreases” ) is inferred. 

So far, this is standard automated cost analysis [3]. The ability to infer 
automatically the remaining annotations represents our main contribution: Each 
AS P has an associated abstract cost function parametric in the locations of its 
footprint, represented by an abstract cost symbol acp. The symbol acp (t, w) in 
the “assert” statement in Fig. 1 can be instantiated with any concrete function 
parametric in t, w being a valid cost bound for the instance of P. For example, 
for the instantiation “P = x=t+1,” the constant function acp (t,w) = 1 is the 
correct exact cost, while acp (t,w) = t with t > 1 is a correct upper bound cost. 

As pointed out in Sect. 1 we require cost invariants to capture the cost of each 
loop iteration. They are declared by the keyword “cost_invariant”. To generate 
them, it is necessary to infer the cost growth of abstract programs that bounds 
the number of loop iterations executed so far. In Sect. 4 we describe automated 
inference of cost invariants including the generation of cost growth for all loops. 
Our technique is compositional and also works in the presence of nested loops. 

The QAE framework can express and prove quantitative relational properties. 
The assertions in the last lines in Fig. 1 use the expression \cost referring to the 
total accumulated cost of the program, i.e., the quantitative postcondition. We 
support quantitative relational postconditions such as \cost_1 > \cost_2, where 
\cost_1, \cost_2 refer to the total cost of the original (on the left) and trans- 
formed (on the right) program, respectively. To prove relational properties, one 
must be able to deduce exact cost invariants for loops such that the comparison 
of the invariants allows concluding that the programs from which the invariants 
are obtained fulfill the proven relational property. Otherwise, over-approximation 
introduced by cost analysis could make the relation for the postconditions hold, 
while the relational property does not necessarily hold for the programs. 

To obtain a formal account of QAE with correctness guarantees we require a 
mathematically rigorous semantic foundation of abstract cost. This is provided 
in the following section. 


3 (Quantitative) Abstract Execution 


Abstract Execution [37,38] extends symbolic execution by permitting abstract 
statements to occur in programs. Thus AE reasons about an infinite set of 
concrete programs. An abstract program contains at least one AS. The semantics 
of an AS is given by the set of concrete programs it represents, its set of legal 
instances. To simplify presentation, we only consider normally completing JAVA 
code as instances: an instance may not throw an exception, break from a loop, 
etc. Each AS has an identifier and a specification consisting of its frame and 
footprint. Semantically, instances of an AS with identifier P may at most write 
to memory locations specified in P’s frame and may only read the values of 
locations in its footprint. All occurrences of an AS with the same identifier 
symbol have the same legal instances (possibly modulo renaming of variables, 
if variable names in frame and footprint specifications differ). For example, by 
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//@ assignable x,y; 

//@ accessible y, z; 

\abstract_statement P; 
we declare an AS with identifier “P”, which can be instantiated by programs 
that write at most to variables x and y, while only depending on variables y 
and z. The program “x=y; y=17;” is a legal instance of it, but not “x=y; y=w;”, 
which accesses the value of variable w not contained in the footprint. 

We use the shorthand P(x,y :~ y,z) for the AS declaration above. The left- 
hand side of “:~” is the frame, the right-hand side the footprint. Abstract pro- 
grams allow expressing a second-order property such as “all programs assigning 
at most x, y while reading at most y, z leave the value of i unchanged”. In Hoare 
triple format (where io is a fresh constant not occurring in P): 


{i = ig} P(x y «© y,z); {i = io} (x) 
3.1 Abstract Execution with Abstract Cost 


We extend the AE framework [37,38] to QAE by adding cost specifications that 
extend the specification of an AS with an annotated cost expression. An abstract 
cost expression is a function whose value may depend on any memory location in 
the footprint of the AS it specifies. This location set is called the cost footprint, 
specified via the cost_footprint keyword (see Fig. 1), and must be a subset of the 
footprint of the specified AS. The cost footprint for the program in (*) might be 
declared as “{z}”. It implicitly declares the abstract function acp (z) that could 
be instantiated to, say, quadratic cost “z?”. 


Definition 1 (Abstract Program). A pair P = (abstrStmts, pavsir) of a set 
of AS declarations abstrStmts # Ú and a program fragment Pabstr containing 
exactly those ASs is called abstract program. Each AS declaration in abstrStmts 
is a pair (P(frame :& footprint), acp (costFootprint)), where P is an identifier; 
frame, footprint, and costFootprint C footprint are location sets. 

A concrete program fragment p is a legal instance of P if it arises from sub- 
stituting concrete cost functions for all acp in abstrStmts, and concrete state- 
ments for all P in abstrStmts, where (i) all ASs are instantiated legally, i.e., by 
statements respecting their frame, footprint, and cost function, and (ti) all ASs 
with the same identifier are instantiated with the same concrete program. The 
semantics [P] consists of all its legal instances. 


The abstract program consisting of only AS P in (*) with cost footprint “{z}” 
is formally defined as: ({(P(x,y :* y,z), acp (z))}, P;). The program “po = 
i=0; while (i <z) {x = z; i ++;}” with cost function “acp (z) = 3-2+4+2” isa 
legal instance: it respects frame, footprint, and cost footprint, as well as the cost 
function, that (assuming z > 0) can be obtained by static cost analysis of P°. 

By encoding the semantics of abstract programs in a program logic [38, Sect. 
4.2] one can statically verify whether an instance is legal. It may require auxiliary 
specifications (invariants, contracts) of the concrete code. The property is unde- 
cidable, but can be proven automatically in many cases, see [38] for a discussion. 
A first implementation of such a check is part of the REFINITY tool (see [36], 
also https://www.key-project.org/REFINITY/). 
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3.2 Cost of Abstract Programs 


Finitely executing a concrete program p starting in a state so = (p, co) with an 
initial assignment co of p’s program variables results in a finite trace of the form 
t = so 5 ... > sp. Each state s; = (pi, ci) consists of a program counter p; 
(the remaining program to execute) and a store o; (the current variable assign- 
ment); each transition s; Ea Si+1 updates s; to s;+ı according to the effect of 
executing command c;+ı defined in the semantics of the programming language. 
A complete trace corresponds to a terminating execution, i.e., Sn = (€, on), where 
€ is the empty program and on the resulting final variable assignment. 

The cost of a program can be computed based on execution traces. To al- 
low arbitrary quantitative properties, we work on a generic cost model M that 
assigns cost values to programming language instructions. We will compute the 
cost of a trace t, denoted M(t), by summing up the costs of the executed in- 
structions. A straightforward measure is the number of executed instructions 
Minstr: In this cost model, instructions like “x=1;”, the evaluation of the loop 
guard, etc., all are assigned cost 1. For example, the cost of the complete trace 
of “while (i >0) i-—;” when started with an initial store assigning the value 3 
to i is 7, because “i ——;” is executed three times and the guard is evaluated four 
times. This can be generalized to symbolic execution: Executing the same pro- 
gram with a symbolic store assigning to i a symbolic initial value 79 > 0 produces 
traces of cost 2-i9 + 1. The cost of abstract programs, i.e., the generalization to 
QAE, is defined similarly: By generalizing not merely over all initial stores, but 
also over all concrete instances of the abstract program. 


Definition 2 (Abstract Program Cost). Let M be a cost model. Let an 
integer-valued expression cp consist of scalar constants, program variables, and 
abstract cost symbols applied to constants and variables. Expression cp is the 
cost of an abstract program P w.r.t. M if for all concrete stores o and instances 
p € [P] such that p terminates with a complete trace t of cost M(t) when 
executed in o, cp evaluates to M(t) when interpreting variables according to o, 
and abstract cost functions according to the instantiation step leading to p. The 
instance of cp using the concrete store o is denoted cp (o). 


Example 1. We test the cost assertion in the last lines of the left program in 
Fig. 1 by computing the cost of a trace obtained from a fixed initial store and 
instances of P, Q. We use the cost model Minstr and an initial store that assigns 
2 to t and 0 to all other variables. We instantiate P with “x=2st;” and Q with 
“y=i; y++;”. Consequently, the abstract cost functions acp (t,w) and ace (t, z) 
are instantiated with 1 and 2, respectively. Evaluating the postulated abstract 
program cost 2 + t- (2+ acp (t,w) + ace (t,z)) for the concrete store and AS 
instantiations results in 2+2-(2+1+2) = 12. Consequently, the execution trace 
should contain 12 transitions, which is the case. 


3.3 Proving Quantitative Properties with QAE 


There are two ways to realize QAE on top of the existing functional verification 
layer provided by the AE framework [37,38]: (i) provide a “cost” extension 
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to the program logic and calculus underlying AE; (ii) translate non-functional 
(cost) properties to functional ones. We opt for the second, as it is less prone to 
introduce soundness issues stemming from the addition of new concepts to the 
existing framework. It is also faster to realize and allows early testing. 

The translation consists of three elements: (a) A global “ghost” variable 
“cost” (representing keyword “\cost”) for tracking accumulated cost; (b) explicit 
encoding of a chosen cost model by suitable ghost setter methods that update this 
variable; (c) functional loop invariants and method postconditions expressing 
cost invariants and cost postconditions. 

Regarding item (c), we support three kinds of cost specification. These are, 
descending in the order of their strength: exact, upper bound, and asymptotic 
cost. At the analysis stage, it is usually impossible to determine the best match. 
For this reason, there is merely one cost_invariant keyword, not three. However, 
when translating cost to functional properties, a decision has to be made. A 
natural strategy is to start with the strongest kind of specification, then proceed 
towards the weaker ones when a proof fails. 

An exact cost invariant has the shape “cost == expr”, an upper bound 
on the invariant cost is specified by “cost <= expr”; asymptotic cost is ex- 
pressed by the idiom “asymptotic(cost) <= asymptotic(expr)”. The function 
“asymptotic” abstracts from constant symbols in the argument. For example, 
the (exact) cost postcondition of the abstract program on the right in Fig. 1 is: 

cost == 2+ acp (t,w) + t- (ace (t,z) + 2) (t) 
Asymptotic cost would be expressed as asymptotic(cost) <= asymptotic(2 + 
acp (t, w) +t- (ace (t,z) + 2)) where the right-hand side of the equation is equiv- 
alent to asymptotic(acp (t, w) + t - (acg (t,z))). 

Listing 2 shows the result of translating the cost invariant in Fig. 1 to a 
functional loop invariant (highlighted lines), using cost model Minstr in ghost 
setters and postconditions of AS (“ensures” clauses). ASs P, Q must include 
the ghost variable “cost” in their frame, because they update its value. The 
keyword \before in the postcondition of an AS refers to the value a variable 
had just before executing the AS. In loops we use “inner” cost variables “iCost” 
tracking the cost inside the loop. When the loop terminates, we add the final 
value of “iCost” to “cost”. After every evaluation of the guard of the loop, the 
cost is incremented accordingly. Using the translation in Listing 2 of the inferred 
annotations in Fig. 1, the AE system proves cost postcondition (f) automatically. 


Apart from the translation of inferred quantitative annotations to functional 
AE specifications, we implemented the axiomatization of the asymptotic function 
and extended the AE system’s proof script language. This made it possible to 
define a highly automated proof strategy for non-linear arithmetic problems 
generated by some cost analysis benchmarks. 


4 Abstract Cost Analysis 


Recall from Sect. 2 that for automatic cost certification we need to infer anno- 
tations for abstract cost invariants and cost postconditions. To achieve this, we 
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1 //@ ghost int cost = 0; 13 //@ decreases t — i; 

2int i = 0; 14 while (i < t) { 

3 //@ set cost = cost + 1; 15 //@ set iCost = iCost + 1; 
4 16 //@ assignable y, cost; 

s //@ assignable x, cost; 17 //@ accessible i,t, y, z; 

6 //@ accessible t, w; 18 //@ ensures cost == 

7 //@ ensures cost == \before(cost) 19 //@ \before(cost) + ace (t, z); 
s //@ + acp (t,w); 20 \abstract_statement Q; 

ə \abstract_statement P; 21 i++; 

10 22 //@ set iCost = iCost + 1; 
11 //@ ghost int iCost = 0; 23 } 

12 //@ loop_invariant i >0 &&i<t 24 //@ set cost = cost + 1; 

13 //@ && iCost == i- (ace (t,z) + 2); 25 //@ set cost = cost + iCost; 


Listing 2: Translation of cost model and cost invariants to AE. 


leverage a cost analysis framework for concrete programs to the abstract setting. 
The presentation is structured as follows: Sect. 4.1 defines the notion of an ab- 
stract cost relation system (ACRS) used in cost analysis for the abstract setting. 
Sect. 4.2 details how to generate automatically inductive cost invariants for ab- 
stract programs from ACRSs. Sect. 4.3 tells how to generate cost postconditions 
used to prove relational properties and required to handle nested loops. 


4.1 Inference of Abstract Cost Relations 


There are two main cost analysis approaches: those using recurrence equations 
in the style of Wegbreit [39], and those based on type systems [14, 24]. Our 
formalization is based on the first kind, but the main ideas for extending the 
framework to abstract programs would be also applicable to the second. The key 
issue when extending a recurrences-based framework to the abstract setting is 
the notion of abstract cost relation for loops which generalizes the concept of cost 
recurrence equations for a loop to an abstract setting. We start with notation 
for loops and technical details on assumed size relations. 


Loops. In our formalization we consider while (G) { 

while-loops containing n abstract state- //@ accessible 741,.--, 71,2. 
ments and m non-abstract statements. //@ assignable w1,1,..., W1, hu1 
Non-abstract statements include any //@ cost_footprint c1,1,.--, C1 he 
concrete instruction of the target lan- \abstract_statement A}; 

guage (arithmetic instructions, condi- non_abstract_statement N1; 


tionals, method calls, ...). We assume } bi 

loops L have the general outline dis- 

played on the right. Each abstract statement has a frame specification, abstract 
and non-abstract statements may appear in any order, either might be empty. 


Size relations. We assume that for each loop sets of size constraints have been 
computed. These sets capture the size relation among the variables in the loop 
upon exit (called base case, denoted ypg), and when moving from one iteration to 
the next (denoted y7). ASs are ignored by the size analysis. While this would be 
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unsound in general, it will be correct under the requirements we impose in Def. 4 
and with the handling of ASs in Def. 3. Size relations are available from any cost 
analyzer by means of a static analysis [13] that records the effect of concrete 
program statements on variables and propagates it through each loop iteration. 
In our examples, since we work on integer data, size analysis corresponds to a 
value analysis [10] tracking the value of the integer variables.” 


Example 2. The size relations for the loop on the left in Fig. 1 are yg = {i > t} 
and y; = {i < t, i’ =i+1}. yp is inferred from the loop guard and y; from the 
guard and the increment of i (primed variables refer to the value of the variable 
after the loop execution). 


Based on pre-computed size relations, we define the cost of executing a loop by 
means of an abstract cost relation system (ACRS). This is a set of cost equations 
characterizing the abstract cost of executing a loop for any input with respect 
to a given cost model M. Cost equations consist of a cost expression governed 
by size constraints containing applicability conditions for the equation (like i < t 
in yr above) and size relations between loop variables (like if = i+ 1 in yr). 


Definition 3 (Abstract Cost Relation System). Let L be a loop as above 
with n abstract and m non-abstract statements. Let T be the set of variables 
accessed in L. Let pr, pg be sound size relations for L, and M a cost model. 
The ACRS for L is defined as the following set of cost equations: 
C(£) = Cg » YB 
CE) = jar ag; (cj, Cine) + ier Ow + CF’), Gr 
where: 

1) Cg > 0 is the cost of exiting the loop (executing the base case) w.r.t. M. 

2) Each ac; (-) > 0 represents the abstract cost for the abstract statement A, 
in L w.r.t. to M. Each ac; is parameterized with the variables in the cost 
footprint of the corresponding Aj, as it may depend on any of them. 

3) Each Cy, > 0 is the cost of the non-abstract statement N; w.r.t. to M. 

4) C is a recursive call. 

5) Z are variables T when renamed after executing the loop. 

6) The assignable variables w;,. in the acj get an unknown value in Z’ (denoted 
with “” in the examples below). 


Ignoring the abstract statements, one can apply a complete algorithm for cost re- 
lation systems [6] to an ACRS to obtain automatically a linear? ranking function 
f for loop L: f is a linear, non-negative function over % that decreases strictly 
at every loop iteration. Function f yields directly the “//@ decreases f;” anno- 
tation required for QAE. 

As in Sect. 3, the definition of ACRS assumes a generic cost model M and 
uses C to refer in a generic way to cost according to M. For example, to infer 
the number of executed steps, C is set to 1 per instruction, while for memory 
usage C records the amount of memory allocated by an instruction. 


? For complex data structures, one would need heap analyses [35] to infer size relations. 
3 There exist (more expensive) algorithms to obtain also polynomial ranking func- 
tions [5] but for the sake of efficiency we are not using them in our system. 
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General Case of ACRS. The definition of ACRS was simplified for presenta- 
tion. The following generalizations, not requiring any new concept, are possible: 
(1) We assume an ACRS for a loop has only two equations, one for the base case 
(the guard G does not hold) and one for the iterative case (G holds). In general, 
there might be more than one equation for the base case, e.g., if the guard in- 
volves multiple conditions and the cost varies depending on the condition that 
holds on the exit. Similarly, there might be multiple equations in the iterative 
case, e.g., if the loop body contains conditional statements and each iteration 
has different cost depending on the taken branch. This issue is orthogonal to 
the extension to abstract cost. (2) A loop might contain method calls that in 
turn contain ASs. In absence of recursion, such calls can be inlined. For recur- 
sive methods, it is possible to compute the call graph and solve the equations 
in reverse topological order such that the abstract cost of the (inner) method 
calls is obtained first and then inserted into the surrounding equations. (3) The 
cost of code fragments not part of any loop (before, after, and in between loops) 
is defined as well by abstract cost equations accumulating the cost of all in- 
structions these fragments include, just as for concrete programs. This aspect 
does not require changes to the framework for concrete programs, so we do not 
formalize it, but just illustrate it in the next example. 


Example 3. The ACRSs of the programs in Fig. 1 are (left program above line, 
right program below): 


Chefore(t, X, wW, yY, Z) = Chefore + Cwe (i,t, x, w, y, Z), {i = 0} 
Cwo (i, t, X, W, Y, Z) = CBwg? {i >t} 
Cwo (i,t, Xx, w, yY, Z) = Cwo + acp (t,w) + ace (t, z) + Cw (i,t, - w, z), {i =i4+1,i<t} 
Carter (t, X, W, Y, Z) = Cafter + acp (t, w) + Cw, (i, t, „w, y, Z), {i = 0} 
Cw, (i,t, x, w, Y, Z) = CBy, > {fi> t} 
Cw, (i,t, x, w, y, Z) = Cw, + acq (t, z) + Cw, (i’, t, x, w, - Z), {ii =i+1,i<t} 


Notation c refers to the generic cost that can be instantiated to a chosen cost 
model M. Cost equation Chefore for the first program is composed of the instruc- 
tions appearing before the loop is Chefore plus the cost of executing the while loop 
Cw- The size constraint fixes the initial value of i. Following Def. 3, there are two 
equations corresponding to the base case of the loop and executing one iteration, 
respectively. Observe that assignable variables in ASs have unknown values in 
the ACRS (according to item (6) in Def. 3). Program after has a similar struc- 
ture. A ranking function for both loops is t — i which is used to generate the 
annotation “//@ decreases t ip’ inserted just before each loop in Fig. 1. 


To guarantee soundness of abstract cost analysis, it is mandatory that (i) no 
AS in the loop modifies any of the variables that influence loop cost, i.e., they 
do not interfere with cost, and (ii) the cost of the AS in the loop is indepen- 
dent of the variables modified in the loop. We call the latter ASs cost neutral. 
The first requirement is guaranteed by item (6) in Def. 3, because the value of 
assignable variables is “forgotten” in the equations. It is implemented, as usual in 
static analysis, by using a name generator for fresh variables. If cost depends on 
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assignable variables in an AS, then the ACRS will not be solvable (i.e., the analy- 
sis returns “unbound cost”). The ACRS in the example contains “_” in equations 
that do not prevent solvability of the system nor its evaluation, because they 
do not interfere with cost. However, if we had “forgotten” a cost-relevant vari- 
able (such as t), we would be unable to solve or evaluate the equations: without 
knowing t the equation guard is not evaluable. Requirement (ii) is ensured by the 
following definition ensuring that variables in the cost footprint are not modified 
by other statements in the loop. 


Definition 4 (Cost neutral AS). Given a loop L, where 


— W(L) is the set of variables written by the non-abstract statements of L. 

— Abstr(L) is the set of all ASs in loop L. 

Frame(Abstr(L)) is the set of variables assigned by any AS A E€ Abstr(L). 
CostFootprint(A) is the set of variables which the cost of an A depends on. 


L is a loop with cost neutral ASs if, for all A € Abstr(L), it is the case that 
(W (L) U Frame(Abstr(L))) A CostFootprint(A) = 0. 


The definition above constitutes a sufficient, but not necessary criterion that 
could be tightened by a more expensive analysis. For instance, our framework 
easily extends to allow conditions in the cost footprint that the concretizations 
of the AS must fulfill. In our example, the cost footprint might include condition 
i’ > i, where 7’ is the value of i after executing the AS. This permits the abstract 
statement to modify i provided it does not decrease its value. Thus, the AS is 
not cost neutral, but the upper bound remains sound. The formalization of this 
generalization is left to future work. 


Example 4. It is easy to check that both loops in Fig. 1 have cost neutral ASs. On 
the left: W(L) = {i}, Frame({P,Q}) = {x,y}, CostFootprint(P) = {t,w}, and 
CostFootprint(Q) = {t,z}, so (W (L) U Frame({P, Q})) A CostFootprint(P) = 9, 
and (W(L)U Frame({P, Q}))N CostFootprint(Q) = 0. The program on the right 
is checked analogously. 


Given a program P with variables z and ACRS with initial equation Cini(T). 
We denote by eval(Cini(T), co) the evaluation of the ACRS for a given initial 
assignment oo of the variables. This is a standard evaluation of recurrence equa- 
tions performed by instantiating the right-hand side of the equations with the 
values of the variables in 09 and checking the satisfiability of the size constraints 
(if the expression being checked or accumulated contains “_”, the evaluation re- 
turns “unbound” ). As usual, the process is repeated until an equation without 
calls is reached. 


Example 5. Consider the ACRS of the left program in Fig. 1 with variables 
(t,x,w,y,Z), initial state co = (2,0,0,0,0), and cost model Minst (thus Chefore; 


CB,,, ANd Cug take values 1, 1 and 2 respectively). The evaluation of the ACRS 
results in eval(Cini(t, x, w, y,Z), (2,0,0,0,0)) = 6 + 2 - acp(2,0) + 2-ace(2,0). 


The following theorem states soundness of the ACRS obtained by applying Def. 3 
provided that all loops satisfy Def. 4. 
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Theorem 1 (Soundness of ACRS). Let M be a cost model and P an ab- 
stract program whose loops satisfy Def. 4. Let cp be the abstract cost of P 
defined as in Definition 2. Let Cini be the initial equation for the ACRS ob- 
tained by Def. 3. For any initial state of the variables oo E€ Z"™, it holds that 
cp(d0) < eval(Cini(Z), c0). 


4.2 From ACRS to Abstract Cost Invariants 


Example 5 shows that ACRSs are evaluable for concrete instances. However, 
to enable automated QAE, we need to obtain from them closed-form cost in- 
variants and postconditions, i.e., non-recursive expressions. We introduce the 
novel concept of abstract cost invariant (ACT) that enables automated, induc- 
tive proofs over cost in a deductive verification system. The crucial difference to 
(non-inductive) cost postconditions as inferred by existing cost analyzers is that 
ACIs can be proven inductively for each loop iteration. Hence, they integrate 
naturally into deductive verification systems that use loop invariants [21]. 

In contrast to ACIs, postconditions provide a bound for the cost after exe- 

cution of the whole loop they refer to. Typically, a postcondition bound for a 
loop has the form maz_iter * max_cost + max_base, where maz_iter is the max- 
imal number of iterations of the loop, mazx_cost is the maximal cost of any loop 
iteration, and max_base is the maximal cost of executing the loop with no itera- 
tions. Instead, an ACI has the form growth» max_cost+maz_base, where growth 
counts how many times the loop has been executed and hence provides a bound 
after each loop iteration. The challenge is to design an automated technique that 
infers growth. We propose to obtain it from the ranking function: 
Definition 5 (Growth). Given a loop with ranking function F = c+); aivi, 
where c and vi are the constant and variable parts of the function, respectively, 
and a; are constant coefficients. If we denote with v? the initial value of variable 
vi before entering the loop, then growth = >>, a; - (v? — vi). 


Example 6. We look at four simple loops with ranking function decreases and 
the growth inferred automatically by applying Def. 5: 


int i = 0; int i = t; int i = 0; int i = t; 
while (i < t) while (i > 0) while (i < t) while (i > 0) 
i++; ==; i += 2; i —= 2; 
decreases t — i decreases i decreases — decreases = 
growth i growth t—i growth 5 growth a 


We can now define the concept of ACI that relies on abstract cost relations 
defined in Sect. 4.1 and growth as defined above. 


Definition 6 (Abstract Cost Invariant). Given an ACRS as in Def. 8 
and its growth as in Def. 5, an abstract cost invariant is defined as follows: 


cinv(Z) = Cg™*+ growth- os acj (cj, geai Cihaz) $y gis) where Cg™* 
stands for the maximal value that the expression Cg can take under the constraints 
pB, and Cy," the maximal value of Cy, under pr. We generate the annotation 


“//@ cost_invariant cinv(Z);”. 
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To obtain the maximal cost of a cost expression under a set of constraints, 
we use existing maximization procedures [5]. 

From Def. 6 we obtain ACIs as closed-form abstract cost expressions of the 
form abexpr = cexpr | ac | abexpr, + abexpr, | abexpr, x abexpr., where 
ac represents an abstract cost function as defined in Sect. 3.1 and cexpr is a 
concrete cost expression. The definition above yields linear bounds, however, the 
extension to infer postconditions in the subsequent section leads to polynomial 
expressions (of arbitrary degree).* 


Example 7 (Abstract Cost Invariant). Consider the first loop in Example 6 
(where growth = i) with the following frame and footprint: 
//@ assignable j; accessible i ,t,j,k; cost footprint k; 

Using Minstr, the evaluation of the loop guard and the increase of i both have 
unit cost, so the ACRS is: 

C(i,t,j,k) = 1 {i > t} 

C(i,t, j,k) = acp (k) +2+C(i,t,_,k) {1 =i+1,i<t} 
The value of the assignable variable j in the recursive call is “forgotten” (item (6) 
in Def. 3), but this information loss does not affect solvability of the ACRS. We 
obtain the following ACI: “//@ cost_invariant 1 + i * (2 + acp(k));”. 


Example 8 (Upper Bound Abstract Cost while (i < t) { 


Invariant). Sometimes an ACI is over- a = new int[i]; 
approximating cost, resulting in an upper //@ assignable j; 

bound ACI. To illustrate this, we add an //@ accessible i,t,j,a,k; 
instruction that creates an array of non- //@ cost_footprint k; 


\abstract_statement P; 
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constant size to the program in Exam- 


ple 7 and measure memory consumption } IFF; 
instead of instruction count. 
The resulting ACRS thus accumulates cost “i” at each iteration, plus the 
memory consumed by the abstract statement: 
C(i,t,j,k) = 0, {i = t} 
G(i,t,j,k) = acp (k) +i+C0,t,4), {i =i+1Li<t} 
Now, maximizing the expression Cy, = i under {i = i+ 1,i < t} results in 


Cy,"** = t—1 and upper bound ACI “//@ costlinvariant i s (t P acp(k));”. 


Let cr denote the abstract cost of executing a loop L (in analogy to cp in 
Def. 2, but considering only loop L rather than the whole program P). We denote 
by cr the portion of the cost in cz up to the execution of iteration J. 


Proposition 1. Let L be a loop with variables T satisfying Def. 4, cinv(Z) its 
ACI, and or E€ Z"™ be the store after performing iteration I of L. Then the 
following holds: (1) cinv(Z) is true on entering the loop; (2) cr(or) < cinv(or). 


“ As our approach is based on a recurrences-based framework [39] that works for 
exponential and logarithmic expressions, the results in this section generalize to 
these expressions. However, the AE deductive verification system is not able to deal 
with them automatically at the moment, so we skip these expressions in our account. 
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4.3 From Cost Invariants to Postconditions 

To handle programs with nested loops and to prove relational properties it is 
necessary to infer cost postconditions for abstract programs. For nested loops the 
cost postcondition states the abstract cost after complete execution of the inner 
loop and it is used to compute the invariant of the outer loop. For relational 
properties, the cost postconditions of two abstract programs are compared. Cost 
postconditions for concrete programs are obtained by upper bound solvers (e.g., 
COSTA [3], CoFloCo [16], AProVE [17]) that compute maz_iter, an upper bound 
on the number of iterations that a loop performs. To do so, one relies on ranking 
functions. We do this as well, but generalize the computation of postconditions 
to abstract programs. The cost postcondition is obtained by substituting growth 
by max_iter in the formula of cinv(%) in Def. 6 as follows. 


Definition 7 (Cost Postcondition). Let L be a loop, max_iter be an upper 
bound on the number of iterations of L. Given the ACRS for L in Def. 3, we 
infer the cost postcondition for L as 


post(Z) = Cg™* + max_iter(Z) - (Da acj (Cj,1s-- agp hed + Dica gas) 
and generate the annotation “JJO assert cost ==post(Z);”. 


To infer the postcondition for a complete abstract program, we take the sum 
of all cost postconditions of its top-level loops plus the cost of the non-iterative 
fragments. Fig. 1 shows the cost postconditions for our running example obtained 
by replacing the growth i of the invariant with the bound t on the loop iterations 
and requiring t > 0. The generation of inductive ACIs for nested loops uses the 
cost postcondition of inner loops to compute the invariants of the outer ones. 
The following theorem states soundness of cost postconditions: 


Theorem 2. Let L be a loop over variables T satisfying Def. 4 and post(T) its 
cost postcondition. Let op € Z™” be the store upon termination of L. Then 
cL(oL) < post(oL). 


5 Experimental Evaluation 


We implemented a prototype of our approach downloadable from https://tinyurl. 
com/qae-impl (including required libraries). The archive contains the bench- 
marks of this section and additional examples as well as build and usage instruc- 
tions. The prototype is a command-line implementation backed by an existing 
cost analysis library for (non-abstract) Java bytecode as well as the deductive 
verification system KeY [2] including the AE framework [37,38]. Our implemen- 
tation consists of three components: (1) An extension of a cost analyzer (written 
in PYTHON) to handle abstract JAVA programs, (2) a conversion tool (written 
in JAVA) translating the output of the analyzer to a set of input files for KeY, 
(3) a bash script orchestrating the whole tool chain, specifically, the interplay 
between item (1), item (2) and the two libraries. In case of a failed certification 
attempt, our script offers the choice to open the generated proof in KeY for fur- 
ther debugging. In total, our implementation (excluding the libraries) consists 
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of 1,802 lines of PYTHON, 703 lines of JAVA, and 389 lines of bash code (without 
blank lines and comments). 


To assess effectiveness and efficiency of our approach, we used our QAE im- 
plementation to analyze seven typical code optimization rules using cost models 
Minstr (rows “1*”—“6*” in Table 1) and Mheap (rows “7*”). While Minstr counts 
the number of instructions, Mpeap measures heap consumption. The first column 
identifies the benchmark (“a” refers to the original program, “b” to the trans- 
formed one), the second P refers to the kind of proven cost result (asymptotic 
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a”, exact “e”, upper “u”), column three shows the inferred growth function for 
each loop in the program (separated by “,” if there are two or more loops), in 
the fourth column we list the cost postcondition obtained by the analysis (ex- 
pressions indicating the number of loop iterations are highlighted), and columns 
five to eight display performance metrics. Time teost, given in milliseconds, is 
the time needed to perform the cost analysis. The proof generation time tproof 
is given in seconds. We also display the time tcneck needed for checking integrity 
of an already generated proof certificate. Finally, sproog is the size of the gener- 
ated KeY proof in terms of number of proof steps. Even though the time needed 
for certification is significantly higher than for cost analysis (which is to be ex- 
pected), each analysis can be performed within one minute. The time to check 
a proof certificate amounts to approximately one fourth to one third of the time 
needed to generate it. We stress that all analyses are fully automatic. 


We briefly describe the nature of each experiment: 1 is a loop unrolling trans- 
formation duplicating the body of a loop: each copy of the body is put inside an 
if -statement conditioned by the loop guard. Here, we had to switch to asymptotic 
cost invariants: The cost analyzer over-approximates the number of iterations 
of the unrolled loop, since there are different possible control flows in the body. 
This was automatically detected by the certifier which failed to find a proof when 
exact cost invariants are conjectured and succeeds with asymptotic ones. 2 is the 
CodeMotion example from Sect. 2. The result reflects the cost decrease in the 
sense that less instructions need to be executed by the transformed program. 3 
implements a LoopTiling optimization at compiler level in which a single loop 
with n -m iterations is transformed into two nested loops, an outer one looping 
until n and an inner one until m. Since our cost analyzer only handles linear 
size expressions, the first program is written using an auxiliary parameter t that 
is then instantiated to value n-m. 4 is a SplitLoop transformation splitting a 
loop with two independent parts into two separate loops. We prove that this 
transformation does not affect the cost up to a constant factor. 5 is an opti- 
mization combining two loops with the same body structure into one loop. 6 is 
a three loops example, one nested and one simple. The optimization combines 
the bodies of the outer loop in the nested structure and the simple loop. 7 is 
an array optimization, where an array declaration is moved in front of a loop, 
initializing it with an auxiliary parameter that is the sum of all the initial sizes. 
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P Cost analysis results tcost |tproof |tcheck| Sproof 
Growth |Postcondition [ms]| [s] | [s] ]#nodes 

lala i Eacp(2) 45.0] 12.9| 4.3 | 1,784 
Ibla i ltacp(z) 53.4] 23.8| 5.0 | 3,472 
2ale i 2+t-(7 + acp(t, w) + ace(t, z)) 50.0] 23.3 | 5.7 | 3,692 
2b/e i 3+ acp(t, w) +6 (6 + aca(t, 2)) 42.0| 19.7| 5.7 | 3,243 
3ale i 2+6 (6 + acp(k)) 49.1| 18.7 | 5.1 | 2,821 
3bel 1,3 6ta (6+ acp(k)) 49.5|23.3| 5.7 | 3,794 
alel i+1 20+ DC + acai (t, w) + acqa(t,2)) [49.5| 23.8] 5.7 | 3,933 
4bieji +1, i+ 12+ FI (12 + acai(t, w) + acgo(t, z))|48.5| 29.4) 7.3 | 5,137 
5aļe i,j 2+n-(6 + acp(y))+m-(6 + acp(y)) 55.1] 25.3] 7.1 | 4,795 
Bb |e i 2+0 F m) (8 + acp(y)) 48.2 14.1 | 4.7 | 2,492 
6ale|k , 7 ,n—1/6+n-(m-(6 + acp(y))+n-(5 + ace(y)) |49.8|32.0| 8.1 | 7,078 
6b] e kj 7+n-(m-(6 + acp(y)) + ace(y)) 49.6] 24.9| 6.4 | 4,995 
7aluli—1 @- DO t-I + acp(y)) 51.2) 15.6| 5.3 | 2,578 
Tbu) i-1 |4- m+@ lacy) 43.3| 13.0| 4.2 | 1,793 


Table 1: Results of the experiments. 


6 Related Work 


The present paper builds on the original AE framework [37,38], which we extend 
to Quantitative AE. At the moment no other approach or tool is able to analyze 
and certify the cost of schematic programs, specifically relational properties, so 
a direct comparison is impossible. 


Cost Analysis. There are many resource analysis tools, including: [20], based 
on introducing counters and inferring loop invariants; [23], based on an analysis 
over the depth of functional programs formalized by means of type systems. 
Approaches that bound the number of execution steps include [19,29], working at 
the level of compilers. Systems such as APROVE [17] analyze the complexity of 
JAVA programs by transforming them to integer transition systems; COSTA [3] 
and COFLOCo [16] are based on the generation of cost recurrence equations 
from which upper bounds can be inferred. That is also the basis of the approach 
we pursue to infer abstract upper bounds in Sect. 4.1, hence our technique can be 
viewed as a generalization of these systems. Approaches based on type systems 
could also be generalized to work on abstract programs by introducing abstract 
cost as in Sect. 4.1. 

For our work it is crucial to use ranking functions to infer growth of cost 
invariants. Ranking functions were used to generate bounds on the number of 
loop iterations in several systems, but none used them to define growth: [10] 
obtain runtime complexity bounds via symbolic representation from ranking 
functions, likewise PUBS [3], Loopus [40], and ABC [8]. PUBS analyses all 
loop transitions at once, LOOPUS uses an iterative procedure where bounds are 
propagated from inner to outer loops, ABC deals with nested, but not sequential 
loops. In our work, when inferring upper bounds, we solve all transitions at once 
and handle nested as well as sequential loops. 
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Certification. Several general-purpose deductive software verification [21] tools 
exist, including VERYFAST [34], Wuy [15], DAFNy [28], KIV [33], and KeY [2]. 
We use KeY, the currently only system to implement AE. Interactive proof as- 
sistants like Isabelle [31] or Coq [7] also support more or less expressive abstract 
program fragments, but lack full automation. There are dedicated approaches in- 
volving schematic programs for specific contexts, like regression verification [18], 
compilation [22, 26,30] or derived symbolic execution rules [12]. 

Regarding the combination of deductive verification and cost analysis, the 
closest approach to ours is the integration of COSTA and KeY [4] which was 
realized for concrete, not abstract programs. They verify upper bounds on the 
cost of concrete programs by decomposing them into ranking functions and size 
relations which are then verified separately. Here we use the novel concept of 
cost invariant that allows verification of quantitative properties without decom- 
position. Paper [4] deals only with the global number of iterations as is common 
in worst-case cost analysis. Our cost invariants are designed to be inductive and 
propagate cost through all loop iterations. Radiéek et al. [32] devise a formal 
framework for analyzing the relative cost of different programs (or the same pro- 
gram with different inputs). Compared to our approach, they target purely func- 
tional programs extended with monads representing cost, while we work with an 
industrial programming language. Moreover, we generally reason about the cost 
of transformations, not of a transformation applied to one particular program. 


7 Conclusion and Future Work 


We presented the first approach to analyze the cost of schematic programs with 
placeholders. We can infer and verify cost bounds for a potentially infinite class 
of programs once and for all. In particular, for the first time, it is possible to 
analyze and prove changes in efficiency caused by program transformations—for 
all input programs. Our approach supports exact and asymptotic cost and a 
configurable cost model. We implemented a tool chain based on a cost analyzer 
and a program verifier which analyzes and formally certifies abstract cost bounds 
in a fully automated manner. Certification is essential, because only the verifier 
can determine whether the bounds inferred by the cost analyzer are exact. 

Our work required the new concept of an (abstract) cost invariant. This is 
interesting in itself, because (i) it renders the analysis of nested loops modular 
and (ii) provides an interface to backends (such as verifiers) that characterizes 
the cost of code in iterations. 

Obvious future work involves extending the analyzed target language. Cost 
analysis and deductive verification (including AE) are already possible for a large 
JAVA fragment [3,37]. More interesting—and more challenging—is the analysis 
of program transformations that parallelize code. The extension to larger classes 
of cost functions, such as logarithmic or exponential, could be realized by inte- 
grating non-linear SMT solvers into the tool chain. 
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Abstract. Modern RESTful services expose RESTful APIs to integrate 
with diversified applications. Most RESTful API parameters are weakly 
typed, which greatly increases the possible input value space. This poses 
difficulties for automated testing tools to generate effective test cases to 
reveal web service defects related to parameter validation. We call this 
phenomenon the type collapse problem. To remedy this problem, we in- 
troduce FET (Format-encoded Type) techniques, including the FET, the 
FET lattice, and the FET inference to model fine-grained information for 
API parameters. Enhanced by FET techniques, automated testing tools 
can generate targeted test cases. We demonstrate Leif, a trace-driven 
fuzzing tool, as a proof-of-concept implementation of FET techniques. 
Experiment results on 27 commercial services show that FET inference 
precisely captures documented parameter definitions, which helps Leif to 
discover 11 new bugs and reduce 72% ~ 86% fuzzing time as compared 
to state-of-the-art fuzzers. 


Keywords: Fuzz Testing - RESTful Web Service - Type Inference. 


1 Introduction 


The REST (Representational State Transfer) architecture [28] nowadays has 
dominated the design of complex web services, such as public clouds (e.g. AWS 
and Azure), social networking (e.g. Facebook and Twitter), and code hosting 
(e.g. GitHub and GitLab). Typically, a RESTful web service exposes a set of 
RESTful APIs. A client requests an API providing parameter values, and the 
service responds with data represented in some common exchange format (e.g. 
JSON or XML). According to a recent survey of 40 real-world popular RESTful 
web services [36], modern services involve an average of 64 APIs and over 20 
parameters per API. Testing such an input space of possible parameter value 
combinatorics is challenging, and therefore automated testing is indispensable. 

Since RESTful APIs are intended for applications implemented by different 
programming languages, API parameters are weakly typed. An investigation 
on 27 RESTful web services [19] shows that over 67% of the parameters are 
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string-typed, about 32% are number-typed, and the remaining 1% are boolean- 
typed or object-typed. Overusing primitive data types significantly increases 
the possible input value space. For example, a string-typed parameter can 
take values varying from a specific URL to a comment about a YouTube video. 
This poses difficulties for generating effective test cases. Consequently, many 
automated REST testing tools are ineffective while RESTful web services suffer 
from various input-related attacks, such as integer overflow attacks and SQL 
injection attacks [18]. We call this phenomenon the type collapse problem. 


The solution is to bridge the gap for automated testing tools to have a better 
understanding of parameters. We observe that though parameter types are weak, 
their values usually have distinct formats. For example, a datetime parameter 
may require an IS08601 date string. This motivates us to introduce the FET 
(Format-encoded Type) which combines data types and value formats to describe 
parameters in fine grains. For instance, the SHA1 FET represents 40-digit-hex 
string-typed parameters. Furthermore, we introduce the FET lattice which 
hierarchically organizes a set of FETs by a partial order, along with the FET 
inference which seeks suitable FETs among a FET lattice for parameters in an 
unambiguous manner. 


To manifest how to enhance automated REST testing by FET techniques, we 
implement Leif, a trace-driven fuzz testing tool. Leif gains fine-grained parameter 
information by performing FET inference on HTTP traffic and then mutates 
parameter values to mimic real attacks based on the inferred results. We apply 
Leif to real-world web services, and the experiment results are encouraging. FET 
techniques provide better bug-finding capability and bring 72% ~ 86% fuzzing 
time reduction for Leif when compared to state-of-the-art fuzzing tools. 


In particular, this paper makes the following contributions: 


— We introduce FET techniques, including the FET, the FET lattice, and the 
FET inference, to remedy the type collapse problem and serve as a cornerstone 
for high-level automated testing tools. 


— We implement Leif, a FET-enhanced fuzzing tool which showcases how to 
construct a ubiquitous FET lattice for common RESTful APIs and embed 
FET techniques in an existing testing workflow. 


— We evaluate the accuracy of FET inference, and the result is encouraging 
(67% exact matches, 32% partial matches, and 1% mismatches on average). 

— We evaluate Leif’s bug-finding capability (11 distinct bugs detected in 27 
commercial web services) as well as its testing efficiency (72% ~ 86% fuzzing 
time reduction as compared to existing fuzzing tools). 


The remainder of the paper is organized as follows. Section 2 analyzes the type 
collapse problem in detail. Section 3 introduces FET techniques to solve the type 
collapse problem. Section 4 introduces Leif as a proof-of-concept implementation 
of FET techniques. Section 5 presents the evaluation of FET techniques and Leif. 
Section 6 discusses related works and Section 7 concludes. 
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2 Motivation 


It is essential for automated REST testing tools to generate test cases by filling 
parameters with automatically generated values. This procedure requires ade- 
quate information about parameters. Otherwise, the possible candidate space 
would become enormous even for one single parameter. Therefore, a majority of 
state-of-the-art automated testing tools focus on reducing the candidate space 
by sophisticated methodologies. For instance, RESTler [13] arranges multiple 
APIs in the producer-consumer order, and uses response data gained from the 
previous APIs to request the next. Chizpurfle [23] and EvoMaster [12] generate 
optimal candidate values based on evolutionary algorithms. 

Nevertheless, the previous works have not focused on the root cause of the 
candidate space explosion. Since most RESTful APIs are designed for exchang- 
ing data between programs implemented by different languages (e.g., Java for 
mobile applications while Python for the service), only a few common primitive 
data types can be used to represent API parameters. For example, Amazon’s 
online shopping web service takes about 2,400 parameters, among which 748 
are number-typed (31%) and 1,581 are string-typed (66%) [19]. That is, types, 
which are supposed to be diversified, now collapse into very limited cases. Conse- 
quently, existing automated testing tools encounter a huge candidate space, e.g., 
solely knowing a parameter is string-typed spans a boundless candidate space 
from paragraphs of Shakespeare to specific datetime strings. In addition, it is 
difficult to pick up effective values that can pass parameter checking, then reach 
actual business logic, and finally trigger bugs. Figure 1 shows a code sample of 
a RESTful API (requires four parameters: string-typed start, string-typed 
end, number-typed amount, and number-typed interest). In order to generate 
an effective value which can reach business logic for the parameter start, a 
testing tool has to know it is an IS08601 datetime string. Unfortunately, since 
parameters are mainly in primitive data types, this information is usually hard 
to obtain. Therefore, the testing tool may treat it as an ordinary string and 
generate arbitrary strings which are all rejected by the parameter checking and 
thus are basically useless. 


1 def calculate monthly _installment(): 

2 try: 

3 start = parse(request.get("start"), "YY YY-MM-DDTHH:MM:SSZ") 
4 end = parse(request.get("end"), "Y YY Y-MM-DDTHH:MM:SSZ") 
5 amount = float(request.get("amount")) 

6 interest = float(request.get("interest")) 

7 except Exception: 

8 return make_response("Invalid Parameter", 400, "Bad Request") 

9 # business logic 

10 


Fig. 1. A Code Sample of a RESTful API (Written in Python). 
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The type collapse problem is the major obstacle to obtaining adequate pa- 
rameter information and leads to inefficient automated testing. Therefore, our 
solution is to provide a fine-grained description method for parameters by ex- 
ploiting both its data type and its value format. Leveraged by such information, 
we are able to bootstrap and enhance automated testing techniques to gain 
efficiency improvement when testing RESTful web services. 


3 FET Techniques 


To address the type collapse problem, we introduce FET techniques, including 
the FET (Format-encoded Type), the FET lattice, and the FET inference. A 
FET models an API parameter by its data type and its value format. A FET 
lattice hierarchically organizes a set of FETs based on a partial order. We design 
FET inference algorithms to seek suitable FETs among a FET lattice for pa- 
rameters, and the inferred results are the critical information for bootstrapping 
test case generation strategies. 


3.1 Type Lattice 


The idea of the FET lattice is inspired by the type lattice [24] for programming 
languages widely used in compilation and program analysis [33, 44,45]. A type 
lattice is a complete lattice defined on (T, E}, where T is a set of data types (e.g. 
long in C/C++) and E is a partial order representing type convertibility. Every 
two lattice elements have a unique least upper bound and a unique greatest lower 
bound. An element t; is said to cover another element t; if and only if t; C tj 
but there does not exist a tm such that ti tei tj, where ti C tj means 
ti C tj and t; Æ tj. Type lattices can model class inheritance hierarchies for 
object-oriented languages. In this context, for any two elements t; and tj, ti E tj 
holds if and only if t; inherits from or equals to t;. Figure 2 depicts a type lattice 
for java.util.Collection (each vertex represents a class or an interface, and 
each directed edge stands for the inheritance relationship). 

The type lattice is the cornerstone of type systems for modern programming 
languages. In static compilation, the type lattice is applied to checking value 
assignment and type casting for code validity [38]. In dynamic compilation, e.g., 
JIT (Just-in-time Compilation) [14], it is employed to predict variable types at 
program points, so as to remove unnecessary type checking. The type lattice is a 
powerful tool to ensure the correctness and efficiency of programs. However, in 
the context of REST, API parameters only manifest limited primitive data types 
due to the type collapse problem, where the type lattice is no longer sufficient. 


3.2 FET Lattice 


A FET lattice is defined on (YW C T x F, <). A FET y € Y is defined by (ty, fy), 
where ty € T is a data type, and fy E€ F is a value format or more specifically 
a set of values. < is a partial order that for any two FETs y; and yj, pi < Yj 
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Fig. 2. A Type Lattice for the Java Collections Framework. 


holds if and only if ty, is type-convertible to ty, and fy, is a subset of fy,, 
denoted by ty, E ty, and fy, © fy: A FET Y; covered by Yj implies that Y; 
describes parameter features in a finer grain than Yj. Yr and Y, are defined 
as (AnyType, U) and (NoType, Ø), where U is the set containing arbitrary values. 
Figure 3 depicts an example FET lattice (a FET’s name describes its value 
format, and FETs at the same level are identically colored). 


FET Acceptance for Parameter Values. Similar to type lattices, FET 
lattices help to determine FETs for given parameter values. To achieve this, we 
define that a value v is accepted by a FET wy if and only if typeof (v) E ty and 
v € fy, denoted by Y% € acceptance(v). Otherwise v is said to be rejected by 
w, denoted by y% ¢ acceptance(v). Spontaneously, w+ accepts all values while 
yw, accepts none. A value v can be accepted by more than one FET, while the 
greatest lower bound of the acceptances describes the value in the finest grain. 
We call such an acceptance the minimum acceptance of v. The predecessors 
of the minimum acceptance accept v but describe it in a coarser grain, while 
the siblings reject v but describe other similar values in the same grain. The 
minimum acceptance, the predecessors, and the siblings of v compose a tree, 
denoted by w-tree(v). For example, for a SHA1 string v, its minimum acceptance 
(the SHA1 FET in Figure 3), the predecessors (Hash, String, and wy) and the 
siblings (MD5, and SHA256) compose the 7-tree(v). 


Avoiding the Ambiguity of FET Lattices. As seen in Figure 3, if a sin- 
gle value is accepted by two sibling FETs (e.g. MD5 and SHA1), the minimum 
acceptance will fall into the trivial ~_. Generally, a FET lattice is said to be 
ambiguous if there exist two FETs with the same predecessor can both accept 
the same value. To avoid ambiguity, a validation procedure is obligatory after 
a FET lattice is constructed, which is to ensure the value formats of every two 
sibling FETs with the same data type are always disjoint. 
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Fig. 3. An Example FET Lattice. 


In practice, we specify value formats by the regular language, and provide 
a ubiquitous FET lattice [20] to model the most common RESTful parameters. 
We will elaborate FET lattice construction and verification in Section 4.2. 


3.3 FET Inference 


Tree-merging FET Inference. As discussed previously, for a single value 
v, a unique w-tree(v) can always be found in an unambiguous FET lattice. A 
RESTful API parameter usually involves multiple values in practice. Hence we 
give the tree-merging FET inference. For a parameter with values v1,--- , Un, 
the tree-merging inference is to compute w-tree(v1),--- ,w-tree(u,), and then 
merge them into one tree. The merged tree is denoted by w-tree”(V,,) where 
Va = {v1, > Un}. The tree-merging inference can be described as a “find- 
expand-merge” procedure: (1) find the minimum acceptance for a single value v; 
by performing a depth-first searching from Yr and add the predecessors along 
the searching path into the tree; (2) expand the tree by adding the siblings and 
then the ~-tree(v;) is obtained; (3) repeat the step (1) and (2) for every value 
and merge all the trees. Step (1) and (2) are illustrated in Figure 4, and step (3) 
can be reduced to the DNS tree merging [25]. Assuming that the FET lattice 
has I levels with m FETs, the time complexity is O(m) for computing one tree 
and O(l) for merging two trees. Thus the time complexity of tree-merging FET 
inference for a parameter involving n values is O(n- (m+1)). 

Bitfield-boosting FET Inference. In practice, we notice that the number 
of FETs m in a lattice is a constant while the number of values n is a variate 
(usually over 1,000). Therefore, we optimize the tree-merging FET inference 
based on three observations: (1) each FET can be uniquely represented by one 
bit in a m-bit bitfield, and therefore w-trees can be represented by several bits 
in such bitfields; (2) given a minimum acceptance, its w-tree can be uniquely 
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Fig. 4. Inferring w-tree(v;) for a Single Value v;. 


determined, so the ~-tree for every FET can be computed before inference; (3) 
merging two w-trees is equivalent to performing a bitwise OR operation on their 
corresponding bitfields. 

Hence, we give the forward computation algorithm and the bitfield-boosting 
FET inference. The forward computation traverses the lattice in breadth-first 
order, assigns a unique bitfield ID per FET, and computes the w-tree, as shown 
in Algorithm 1. Leveraged by the forward computation, the bitfield-boosting 
inference only needs to find the minimum acceptance by the depth-first search- 
ing, yields the bitfield tree, and merges it into the ~-tree’~'(Vj_1), as shown 
in Algorithm 2. Therefore, the w-tree"(V,,) can be efficiently computed by a 
series of bitwise OR operations instead of graph computations, reducing the time 
complexity from O(n - (m +1)) to O(n - m). 


4 FET-enhanced REST Fuzzing 


To manifest the utility of FET techniques, we design Leif, a FET-enhanced REST 
fuzzing tool, and we implement it to a command-line tool in 2,796 lines of Python 
code. This section elaborates the workflow of Leif, along with methodologies for 
collecting HTTP traffic (Section 4.1), for constructing FET lattices (Section 4.2), 
and for interfacing FET techniques with fuzzers (Section 4.3). 

Figure 5 depicts Leif’s workflow and its interaction with existing systems 
and tools. Leif assumes that the web service under test is already deployed 
on a staging server or in a production environment. The developer acquires 
the Leif program with a built-in FET lattice and traces HTTP traffic between 
the service and the clients. Then Leif identifies RESTful APIs by parsing the 
captured traffic and performs FET inference on parameter values. The inferred 
results are provided to bootstrap test case generating. Finally, Leif emits test 
cases and observes wrongful behaviors of the service. 
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Algorithm 1: The Forward Computation. 
Input: A FET Lattice. 


ID +1; queue + Queue(wr); 
while !queue.isEmpty() do 
current + queue.pop(); 
current.ID + ID; 
ID + ID << 1; 
foreach w < current AND y 4, do 
| queue.push(w); 


wr.pTree+0; wr.sTree + wt .ID; 
wr .tree — wr.pTree V wr.sT ree; 

10 queue + Queue(wr); 

11 while !queue.isEmpty() do 


Noa pwnrdpe 


o œ 


12 current < queue.pop(); 

13 sTree + 0; 

14 foreach Y < current AND w # pı do 

15 sTree + sTree V w.ID; 

16 foreach w < current AND y Æ yı do 

17 w.pTree + current.pTree V current. I D; 
18 w.sTree + sTree; 

19 w.tree + pT'ree V sT ree; 

20 queue.push(w); 


4.1 Collecting and Parsing HTTP Traffic 


As introduced in Section 3.3, the inferred result of a parameter is contributed by 
its different values, and therefore the accuracy of FET inference increases when 
Leif witnesses more value cases. Thus developers are expected to apply suitable 
tracing methods. For example, monkey testing and scripted regression testing 
are more preferred than unit testing to collect traffic. Leif takes the HAR file (an 
archival format for HTTP traffic [39]), which is the standard output of network 
proxies (Fiddler, MitmProxy [22], etc.), and browser inspection (e.g. Chrome, 
and Safari). To identify parameters, the payload (including the headers, the 
query string, and the body) of a captured request is parsed to key-value pairs 
in JSON format. Due to the type collapse problem, only four data types are 
present: boolean, number, string and object (including array). Non-object- 
typed parameters are directly provided to FET inference while object-typed 
parameters are flattened. Since a JSON object is a tree of properties, Leif flattens 
it by splitting leaf properties to independent non-object-typed parameters and 
assigning new keys named by their JSONPaths [29], as illustrated in Figure 6. 
Then the flatten parameters are also provided to FET inference. Finally, FET 
inference receives parameters for each API where each parameter has a unique 
key and usually multiple values. 
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Algorithm 2: The Bitfield-boosting FET Inference. 


Input: Parameter Values Vp = {v1,--- , Un}. 
Output: ¢-tree"(V,). 
1 w-tree®(Vo) + 0; 
2 for i 4 1 to n do 
current — wr; 
accepted < true; 
while accepted do 
accepted + false; 
foreach y < current do 
if ù € acceptance(v;) then 
| current <— w; 


accepted < true; 


own on A w 


10 


11 -tree (Vi) + Y-treet™! (V;i—1) V current.tree; 


12 return y-tree” (Vn); 


4.2 Ubiquitous FET Lattice 


Regular Expressions for Value Formats. In Leif’s built-in ubiquitous FET 
lattice, value formats are specified by regular expressions. We choose to use the 
regular language rather than creating a new language to define value formats 
because it has many advantages in this scenario. Firstly, regular expressions are 
the de-facto descriptions of most string formats. Although regular expressions are 
context-free, they can still distinguish different value formats. Secondly, they are 
already familiar to developers, and therefore they are easy to construct without 
extra learning costs. Finally, to ensure the unambiguity of a FET lattice is 
to ensure the regular expression orthogonality of sibling FETs, which can be 
formally determined by finite automata [46]. 

FET Lattice Constructing and Updating. We construct the ubiquitous 
FET lattice by referencing popular RESTful services (e.g. Google Map, AWS, 
Twitter, and GitHub): (1) we crawl API documents from these services and 
then identify potential FETs used in these services; (2) we construct regular 
expressions for these FETs by referencing related RFCs (e.g. RFC3339 [35] for 
IS08601, and RFC3986 [16] for URI), programming language specifications (e.g. 
the Java specification [34] for PackageName), and database schema definitions 
(e.g. the MongoDB data type definition [21] for Hash) to build a base FET 
lattice; (3) we apply the Bayesian regular expression generation technique [42] 
to discover new FETs from traffic and merge them into the base lattice; (4) we 
verify the unambiguity by checking the orthogonality of regular expressions for 
sibling FETs, using dk.brics.automaton library [37]. The verified lattice has 
21 FETs organized in 5 levels, and we believe it is competent to model most of 
the RESTful services. If a developer has application-specific FETs (at the first 
usage or when major service updates take place), one can update the lattice by 
adding FETs via step (3) and repeat step (4) for unambiguity verification. 
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<object> { 


<string> $.title: “A Brief History of Time” 
<number> §.price: 45.00 

<string> = §.cata 
<string>  S.cata 


“main”: “Science”, 
“sub”: { 
“main”: “Cosmology” 


sue.main: “Science” 
zue sub main: “Cosmology” 


(a) The Original Parameter. (b) The Tree Structure. (c) The Flattening Result. 


Fig. 6. An Example of Object Flattening. 


Twinning FET Inference. We notice some parameters can be represented 
by multiple data types and are minimally accepted by distinct FETs in different 
data types. For example, an epoch datetime (elapsed seconds or milliseconds 
since 1970-01-01 00:00:00) is accepted by the EpochString FET when it is 
represented by string while is accepted by the Integer FET when in number. 
Apparently, applying type casting to such parameters is very meaningful during 
testing. To support this feature, we implement the twinning FET inference. 
Before a value is inferred, Leif generates its twinning value if possible. If the 
original value is number-typed, Leif generates a twinning string-typed value 
(e.g. 1589809244481 — "1589809244481") and vice versa ("1589809244481" 
— 1589809244481). Then both values are inferred, and the resulting two %- 
trees are merged as if Leif witnesses two independent values. By doing so, both 
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the Datetime and the Integer FETs are included in the final #-tree” of an 
epoch datetime parameter. 


4.3 FET-aware Trace-driven Fuzzing 


Trace-driven fuzzing tools generate test cases by replacing parameter values 
of captured requests with candidate values. Therefore the success of a fuzzer 
mainly depends on its quality of candidate values. In conventional tools, using a 
larger candidate dictionary is the basic strategy to increase the opportunity for 
triggering bugs, yet it lengthens the fuzzing time. 

On the contrary, Leif provides a small but targeted dictionary for each FET 
and we give several examples (corresponding to Figure 3): Number is tried with 
integer overflows (8-bit, 16-bit, 32-bit, and 64-bit overflows) with signed and 
unsigned values; Datetime is tried with year overflows (year 2038, and year 
10,000), invalid dates (e.g. 2019-2-29), and timezone tweaks; ISO08601 is tried 
with omitting meta characters ("-", ":", etc.); URI is tried with malformed URLs 
(e.g. doubling "/", stripping "protocol://", and unescaped characters). With 
each parameter tagged by a w-tree”, Leif generates test cases by exhausting 
dictionaries of all the FETs in the tree. Notice that, as discussed in Section 3.2, 
the predecessors and the siblings of the minimum acceptance describe similar 
but usually invalid values. Therefore, candidates from these FETs are the most 
likely values which can pass parameter checking and trigger bugs. For an API 
with multiple parameters, Leif exhausts dictionaries for one parameter each time 
and tests such API by iterations of exhaustion. In this way, Leif increases the 
opportunity to trigger bugs and meanwhile saves the fuzzing time. 


5 Evaluation 


In this section, we evaluate Leif with real-world RESTful web services, and the 
complete dataset of our evaluation is publicly available [19]. Specifically, we 
design three experiments to answer the following research questions: 


RQ-1 How accurately do FET inference results describe RESTful API param- 
eters of complicated real-world web services? 

RQ-2 Can Leif generate effective test cases and therefore help developers to 
detect web service vulnerabilities in practice? 

RQ-3 Does Leif have better bug-finding capability with reduced fuzzing time 
when compared to existing state-of-the-art trace-driven and specification- 
driven fuzz testing tools? 


5.1 FET Inference Accuracy Evaluation 


In this experiment, we assume that API documents provided by the service 
developers are the ground truth and we validate the accuracy of FET inference 
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by comparing the inferred results with the ground truth. We choose GitHub? 
and Twittert, and we randomly pick up 50 RESTful APIs (25 from each). We 
extract two pieces of information from document text: (1) parameter data types, 
as explicitly listed in the documents; (2) parameter value formats, as provided 
in the detailed descriptions (e.g. “This [the parameter since] is a timestamp in 
IS08601 format.”°). We feed example requests gained from the documents to 
FET inference, compare the inferred FETs with the ground truth, and observe 
three levels of matching: 


(1) exact match, the inferred FET is said to be an exact match if it has the 
exactly same data type and the value format as the ground truth; 

(2) partial match, the inferred FET is said to be a partial match if it has 
the exact data type, but its value format is a proper superset of the ground 
truth; 

(3) mismatch, for the remaining cases. 


Intuitively, an exact match precisely describes a parameter such that a fuzzer 
can exploit it to generate the most targeted values. A partial match is benign, 
for it includes values that will not appear in practice, and a fuzzer may generate 
a small set of useless values based on a partial match. A mismatch indicates that 
the value format is not yet supported by the current FET lattice. 
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Fig. 7. FET Inference Accuracy Evaluation Results. 


Figure 7(a) exhibits the ratios of matching on GitHub (137 parameters), 
Twitter (86 parameters) and the weighted average (223 parameters). In total, 
149 (67%) inferred results are exact matches, and 71 (32%) are partial matches. 


3 https: //docs.github.com/en/free-pro-team@latest /rest /reference 
* https: //developer.twitter.com/en/docs 
5 https://docs.github.com/en/free-pro-team@latest /rest /reference/gists 
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And we observe 3 mismatches in two cases: one is a binary-array parameter 
for file uploading and the other is an array of key-value pairs (e.g. [["key1", 
"valuei"], ["key2", "value2"], ...]). Binary arrays can be supported by 
adding a FET ([01]* for the value format) to the current lattice, but Leif aims 
to detect logic-related bugs while binaries are usually logic-free but content- 
sensitive [43]. Therefore Leif simply does not mutate them. As for key-value pairs, 
they are actually two-dimensional arrays where the first dimension is immutable 
since it indicates the actual parameter key. We consider allowing developers 
to specify which special parameters are immutable in Leif’s future version to 
support such cases. For the partial matches, we review the documents, and the 
top cases are application-specified formats such as comma-separated strings 
and PGP signatures. These formats are less common and developers can add 
application-specific FETs to their lattices by following the steps introduced in 
Section 4.2. Figure 7(b) exhibits the breakdown of exact matches (the inner 
ring is the distribution of the primitive data types and the outer ring is the 
inferred FETs) to quantify how FET inference improves parameter information. 
The coarse-grained number-typed (27%) and string-typed (61%) parameters are 
divided into much smaller slices (5% ~ 14%). The breakdown clarifies that FET 
inference classifies parameters in balance, and therefore restores the collapsed 
types. This enables a fuzzer to generate more targeted values, which shrinks 
candidate space and increases the opportunity to find bugs. 


5.2 Leif Effectiveness Evaluation 


In this experiment, we select 27 popular mobile applications to evaluate the ef- 
fectiveness of Leif. Each of them is backed by a commercial RESTful web service 
serving millions and billions of users. We monkey-test [30] each application for 
20 minutes, capture HTTP traffic and run the full-stack Leif workflow. Table 1 
lists the subjects and the services have an average of 133 RESTful APIs with 
over 19 parameters per API. We collect 46 requests per API on average which 
yields adequate request samples for inference. Leif reports 5XX HTTP responses 
as bugs along with the corresponding traffic. We have reached out to the service 
owners, reported these bugs, and validated these bugs through analysis of traffic 
(through API URLs, parameter key-value pairs, and response data) and analysis 
of the involved applications (through reverse engineering and static code analysis 
of APKs) to eliminate any false-positive or duplicated cases. Table 2 summarizes 
the 11 distinct bugs found by Leif. The testing process is fully automated which 
mimics how developers would use Leif as a black-box fuzzing tool in practice 
and our following analysis mimics how to classify bugs and locate related code 
lines based on Leif’s testing results. 

Security Bugs with Information Leakage. Bug 1, 2 and 10 are security bugs 
with information leakage problems. They can be reproduced by mutating the 
parameter appVer (VersionTag), the parameter platform (Identifier), and 
the parameter c.v (Integer). These bugs not only cause service crashes but also 
expose sensitive information to end users (potential attackers). With the exposed 
information, attackers can easily design specialized attacks. For example, the 
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Table 1. Experiment Subjects of the Effectiveness Validation. 


Pe Category Downloads* Version sige (MB) 7 e A 
Amazon Shopping 205M+ 18.4.0.1 213 142 2,380 
Baidu Tools 2.8B+ 11.15.0.12 453 332 4,742 
Bilibili Video 220M+ 5.49.0 524 219 4,338 
Damai Shopping 6.6M+ 7.6.4 596 104 1,535 
Dianping Social 340M+ 10.19.12 629 148 2,247 
Eleme Social 180M+ 8.26.3 230 57 992 
Hupu Reading 11.1M+ 7.3.26 295 229 4,446 
iQiyi Video 2.5B+ 10.10.0 1,338 257 7,063 
Jianshu Reading 6.4M+ 4.16.0 339 111 1,609 
Jingdong Shopping 950M+ 8.3.2 514 131 1,521 
Kaola Shopping 15.3M+ 4.3.5 322 252 3,848 
Mafengwo Trip 21.3M+ 9.3.33 340 151 3,178 
Meituan Shopping 1.4B+ 10.3.401 1,111 58 1,151 
MissFresh Shopping 16.3M+ 9.6.4 348 50 719 
ONE Reading 4.8M+ 4.6.2 242 53 567 
Pinduoduo Shopping 1.9B+ 4.77.0 795 79 866 
Qunar Trip 330M+ 8.9.28 1,246 146 1,563 
Shanbay Tools 2.91M+ 4.2.6502 84 9 94 
Sina News News 110M+ 7.25.1 266 53 724 
Smzdm Shopping 8.5M+ 9.5.26 267 104 1,866 
Sohu News News 170M+ 6.1.8 591 201 3,144 
Tencent News News 2.9B+ 5.9.00 1,045 142 1,796 
Tmall Shopping 310M+ 9.1.0 177 49 635 
Toutiao News 2.0B+ 7.4.8 1,198 323 12,408 
Tuniu Trip 79.7M+ 10.19.0 217 68 772 
WUBA Social 370M+ 9.1.2 79 123 5,490 
Xiaohongshu Social 66.3M+ 6.19.0 295 20 334 
Total 13,754 3,611 70,028 


“ The statistic is from Tencent AppStore (https://sj.qq.com) up to Jan. 9th, 2020. 


response data of bug 10 contains the full Java exception stack trace without 
any obfuscation. From the stack trace, attackers can obtain that the service uses 
an outdated Spring Framework® version which suffers from numerous security 
vulnerabilities [5,6,8—11]. By exploiting CVE-2020-5421 and CVE-2020-5398 [10, 
11], attackers can initiate reflected file download attacks [31] to mislead users 
into downloading malware. And by exploiting CVE-2018-1257 [5], attackers can 
expose STOMP over WebSocket and then initiate denial of service attacks [17]. 
They can also obtain that the service uses com.alibaba.fastjson library’ to 
deserialize user inputs. Therefore attackers can launch remote code executions 
by exploiting known defects in that specific library version [7,32]. 

Upon such cases, we suggest developers should first avoid information leakage 
problems by checking the service data flow, ensuring that no sensitive methods 


6 Spring Framework, https://spring.io/projects/spring-framework 
T Fastjson, https://github.com/alibaba/fastjson 


60 Y. Chen et al. 


Table 2. Bugs Found by Leif during the Effectiveness Validation. 


Bug Involved Status 


ID Application Code API Path Description 
1 iQiyi 500 /book/register A private API, served for user registration. 
A private API, served for first-screen ad- 


2 Pinduoduo 500 /cappuccino/splash Vertis nE. 


A deprecated public API provided by Sina 
Weibo, served for user authorization. 

A deprecated public API provided by Sina 
Weibo, served for user authorization. 

A public API provided by Baidu, served for 
inter-application integration. 

A public API provided by 53KF, served for 
customer service. 

A public API provided by 53KF, served for 
customer service. 

A private API, probably served for inter- 


3% Sina News 500 /oauth2/getaid.json 


4? Sina News 503 /oauth2/getaid.json 


BP Smzdm 502 /integration.php 


6° Sohu News 502 = /sendacc.jsp 


7°  Sohu News 502 /sendacc.jsp 


8 Toutiao 502 /user/tab/tabs/v3 application redirecting: 
9 Toutiao 504 /user/tab/tabs/v3 AR ee AFL přopably served fon Anter: 
application redirecting. 
10 Tuniu Stith. /vip/recommenä A private API, served for content recom- 
mendation. 
uP WUBA 502 /integration.php A public API provided by Baidu, served for 


inter-application integration. 


° Bug 3 and bug 4 involve the same API but with different HTTP status codes. 


. Bug 5 and bug 11 involve the same API but different applications. 
Bug 6 and bug 7 involve the same API path but different domain names. 


(e.g., java. lang.Exception.toString) can be output to end users, and then 
diagnose security problems by analyzing server logs. Besides, they should stay 
alert to public vulnerability reports and timely upgrade their codebases. 


Third-party API Bugs. We notice that 6 of the bugs involve APIs provided 
by third parties. Bug 3 and 4 involve the API for user authorization provided by 
Sina Weibo, a social networking platform serving over half a billion users. We 
decompile the Sina News APK and locate the related code lines. We find out 
the application uses a deprecated version of the API. When this API fails, an 
unhandled exception is propagated and causes the application to crash. It can be 
reproduced by injecting meta characters "/.:/" to the parameter packagename 
(PackageName) and to the parameter mfp (Hash). Bug 6 and 7 involve the API 
provided by a customer service platform. The application also suffers from the 
deprecated API and crashes when the API fails. Bug 5 and 11 are detected in 
different applications but involve the same API provided by Baidu. These two 
bugs can be reproduced by mutating the parameter SdkVer (VersionTag). 


Using third-party APIs is very common, but they are often overlooked during 
testing. However, bugs in third-party code are as important as the application’s 
own code, because they both mean application functionality failure to billions 
of end users. Our results show that Leif can find bugs across into third-party 
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APIs. We suggest that developers should capture application traffic and apply 
Leif to test untrusted third-party APIs. In addition, they should design proper 
exception handling logic for third-party code and timely upgrade to the latest 
API versions with known bugs fixed. 

Bugs with Limited Information. We obtain very limited information from 
bug 8 and 9, because their responses solely contain HTTP status codes. These 
bugs could be as critical as the security bugs since they involve a private API 
and cause the service to crash. Therefore service developers can debug such APIs 
by following the analysis methods for the security bugs as mentioned. 


5.3 Comparative Evaluation 


Leif vs. Trace-driven Fuzzers. We classify Leif as a trace-driven fuzzer and 
we now compare it with state-of-the-art trace-driven fuzzing tools. We select 
BurpSuite [2], a commercial security testing fuzzer for RESTful web services, and 
Fuzzapi [3], an open-source general-purpose HTTP fuzzer. They provide built-in 
candidate dictionaries but require a series of manual configurations, including 
filling the URL for each API and the data type for each parameter. Therefore 
we only apply them to Sina News, Toutiao, and Amazon Shopping (518 unique 
APIs with 15,512 parameters in total). In addition, we implement NaiveFuzzer 
as a baseline that only understands primitive data types and randomly mutates 
parameter values solely based on such coarse-grained information. We construct 
NaiveFuzzer’s candidate dictionaries by combining the dictionaries of BurpSuite 
and Fuzzapi. 

We evaluate the bug-finding capabilities of BurpSuite, Fuzzapi, Leif, and 

NaiveFuzzer by comparing the number of bugs found by each tool, as reported 
in Figure 8(a). And we evaluate their fuzzing time by comparing the averaged 
number of test cases generated per parameter, as exhibited in Figure 8(b). Less 
generated test cases mean less test execution time, leading to the more efficient 
fuzzing. Considering the subjects are already well-tested before release, we be- 
lieve the bug-finding capability of Leif is better than BurpSuite and Fuzzapi 
for Leif finds extra bugs. And NaiveFuzzer has the same capability as BurpSuite 
and Fuzzapi. This is because they share the same candidate space. As for fuzzing 
time, BurpSuite, Fuzzapi and NaiveFuzzer respectively generate 5.0x ~ 6.7x, 
3.6x ~ 4.7x and 6.3x ~ 7.1x test cases of Leif, indicating FET techniques 
bring 72% ~ 86% fuzzing time reduction. 
Leif vs. Specification-driven Fuzzers. We now compare Leif with existing 
specification-driven fuzzers, which test RESTful web services based on parsing 
API specifications. We select RESTler [13], a state-of-the-art research fuzzer, 
and TnT-Fuzzer [4], an open-source robustness testing tool. They both require 
OpenAPI specifications [40] as input, but most of the subject services do not 
provide OpenAPI specifications. Therefore we construct OpenAPI specifications 
for Sina News, Toutiao, and Amazon Shopping by parsing HTTP traffic and 
referencing their official API documents. 

We intend to run RESTler, but unfortunately neither the executable program 
nor the source code is available. According to the paper, RESTler only supports 
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Fig. 8. Bug-finding Capabilities and Fuzzing Time of the Evaluated Fuzzers. 


primitive data types and uses a plain candidate dictionary (consisting of 0,1, "", 
and "sampleString"). Yet none of the bugs found by Leif can be triggered by 
these values, indicating that performing RESTler would fail to detect any of the 
bugs. And TnT-Fuzzer generates candidate values simply based on the Python 
random() function (i.e. purely random fuzzing). We configure it to generate 
1,000 test cases per parameter (about 5x of NaiveFuzzer and 30x of Leif). Still, 
TnT-Fuzzer fails to find any bugs in the three services. We conclude that the 
two fuzzers’ effectiveness is limited by the practical hardness of finding well- 
written OpenAPI specifications and the quality of their candidates. These are 
also the main shortcomings of all specification-driven fuzzers. Besides, many 
modern APIs require short-lived session tokens for access control or throttling. 
Specification-driven fuzzers require manual configuration or even repeated re- 
configuration for such parameters. In contrast, it is easy for trace-driven fuzzers 
to achieve this requirement by mutating freshly captured requests. 


6 Related Work 


Model-driven Testing. Model-driven testing [15, 26, 27, 47, 48] is usually 
white-box and requires using some specific modeling method (e.g. UML or 
DSL) through the whole lifecycle of developing, which is human-intensive and 
technically-limited for services across multiple servers and micro-services from 
different vendors. Essentially, FET techniques are also model-driven (i.e. driven 
by the lattice model) but only intervene in the test phase. Thus FET techniques 
can be practically employed to test diversified RESTful web services in black-box 
approaches. 

Trace-driven Fuzzing. Trace-driven fuzzing generates test cases by mutating 
recorded requests. Fuzzapi [3], BurpSuite [2], AppSpider [1] and Leif all fall 
into this category. Existing trace-driven fuzzers mainly focus on improving the 


Bootstrapping Automated Testing for RESTful Web Services 63 


ability to capture and replay HTTP traffic. However, Leif demonstrates that FET 
techniques provide fundamental parameter information to fuzzers, bringing the 
enhanced bug-finding capability and significant fuzzing time reduction. 
Specification-driven Fuzzing. Another main class of fuzz testing techniques 
is specification-driven fuzzing, such as TnT-Fuzzer [4], EvoMaster [12], and 
RESTler [13], which avoids the type collapse problem by assuming developers 
provide well-defined specifications with detailed parameter information. How- 
ever, the OpenAPI [40] is the only well-established standard up to now, yet is 
not widely used. A survey [41] reveals that 71% developers lack the knowledge of 
the OpenAPI framework. Therefore, the specification-driven fuzzing is still too 
idealistic for testing real-world RESTful web services. In comparison, instead of 
asking developers for good specifications, FET techniques generate fine-grained 
specifications (i.e. w-trees” of parameters) on its own. 

Security Penetration Testing. Fuzz testing techniques are also commonly 
purposed for security penetration testing. Commercial security penetration tools, 
such as BurpSuite [2], use values of SQL injections, unescaped HTML charac- 
ters, XML/JSON external entities, etc., to expose system vulnerabilities. FET 
techniques can also be employed in security penetration testing, as demonstrated 
in Section 5.2. While our main goal is not limited to security testing for RESTful 
web services, because FET techniques improve the value selecting strategy for 
general-purpose REST fuzzing. 


7 Conclusion and Future Work 


In this paper, we analyze the type collapse problem and propose FET tech- 
niques to remedy this problem. As a proof-of-concept, we design and implement 
Leif, a FET-enhanced trace-driven fuzzing tool. We demonstrate that using FET 
techniques greatly improves a fuzzer’s understanding of parameters, resulting in 
more effective fuzz testing. Our experiment results show that Leif unveils 11 new 
bugs in application-specific web services as well as general third-party open API 
platforms with 72% ~ 86% fuzzing time reduction. 

FET techniques are capable of effectively bootstrapping automated testing 
tools. We believe they are also helpful for parameter validity checking because 
these two technical problems are isomorphic in a sense. Thus we are beginning to 
study how to automatically generate or enhance parameter checking code based 
on FET techniques for RESTful web services. 
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Abstract. Lifted (family-based) static analysis by abstract interpreta- 
tion is capable of analyzing all variants of a program family simultaneously, 
in a single run without generating any of the variants explicitly. The ele- 
ments of the underlying lifted analysis domain are tuples, which maintain 
one property per variant. Still, explicit property enumeration in tuples, 
one by one for all variants, immediately yields combinatorial explosion. 
This is particularly apparent in the case of program families that, apart 
from Boolean features, contain also numerical features with large domains, 
thus giving rise to astronomical configuration spaces. 

The key for an efficient lifted analysis is a proper handling of variability- 
specific constructs of the language (e.g., feature-based runtime tests and 
#if directives). In this work, we introduce a new symbolic representation 
of the lifted abstract domain that can efficiently analyze program families 
with numerical features. This makes sharing between property elements 
corresponding to different variants explicitly possible. The elements of 
the new lifted domain are constraint-based decision trees, where decision 
nodes are labeled with linear constraints defined over numerical features 
and the leaf nodes belong to an existing single-program analysis domain. 
To illustrate the potential of this representation, we have implemented 
an experimental lifted static analyzer, called SPLNUM? ANALYZER, for 
inferring invariants of C programs. An empirical evaluation on BusyBox 
and on benchmarks from SV-COMP yields promising preliminary re- 
sults indicating that our decision trees-based approach is effective and 
outperforms the baseline tuple-based approach. 


1 Introduction 


Many software systems today are configurable [6]: they use features (or config- 
urable options) to control the presence and absence of functionality. Different 
family members, called variants, are derived by switching features on and off, while 
the reuse of common code is maximized, leading to productivity gains, shorter 
time to market, greater market coverage, etc. Program families (e.g., software 
product lines) are commonly seen in the development of commercial embedded 
software, such as cars, phones, avionics, medicine, robotics, etc. Configurable 
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options (features) are used to either support different application scenarios for 
embedded components, to provide portability across different hardware platforms 
and configurations, or to produce variations of products for different market 
segments or customers. We consider here program families implemented using 
#if directives from the C preprocessor CPP [20]. They use #if-s to specify in 
which conditions parts of code should be included or excluded from a variant. 
Classical program families use only Boolean features that have two values: on and 
off. However, Boolean features are insufficient for real-world program families, 
as there exist features that have a range of numbers as possible values. These 
features are called numerical features [25]. For instance, Linux kernel, BusyBox, 
Apache web server, Java Garbage Collector represent some real-world program 
families with numerical features. Analyzing such program families is very chal- 
lenging, due to the fact that from only a few features, a huge number of variants 
can be derived. 


In this paper, we are concerned with the verification of program families with 
Boolean and numerical features using abstract interpretation-based static analysis. 
Abstract interpretation [7,24] is a general theory for approximating the semantics 
of programs. It provides sound (all confirmative answers are correct) and efficient 
(with a good trade-off between precision and cost) static analyses of run-time 
properties of real programs. It has been used as the foundation for various 
successful industrial-scale static analyzers, such as ASTREE [8]. Still, the static 
analysis of program families is harder than the static analysis of single programs, 
because the number of possible variants can be very large (often huge) in practice. 
The simplest brute-force approach that uses a preprocessor to generate all variants 
of a family, and then applies an existing off-the-shelf single-program analyzer to 
each individual variant, one-by-one, is very inefficient [3,27]. Therefore, we use 
so-called lifted (family-based) static analyses [3,22,27], which analyze all variants 
of the family simultaneously without generating any of the variants explicitly. 
They take as input the common code base, which encodes all variants of a 
program family, and produce precise analysis results corresponding to all variants. 
They use a lifted analysis domain, which represents an n-fold product of an 
existing single-program analysis domain used for expressing program properties 
(where n is the number of valid configurations). That is, the lifted analysis 
domain maintains one property element per valid variant in tuples. The problem 
is that this explicit property enumeration in tuples becomes computationally 
intractable with larger program families because the number of variants (i.e., 
configurations) grows exponentially with the number of features. This problem 
has been successfully addressed for program families that contain only Boolean 
features [1,2,11], by using sharing through binary decision diagrams (BDDs). 
However, the fundamental limitation of existing lifted analysis techniques is that 
they are not able to handle numerical features. 


To overcome this limitation, we present a new, refined lifted abstract domain 
for effectively analyzing program families with numerical features by means of 
abstract interpretation. The elements of the lifted abstract domain are constraint- 
based decision trees, where the decision nodes are labelled with linear constraints 
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over numerical features, whereas the leaf nodes belong to a single-program analysis 
domain. The decision trees recursively partition the space of configurations (i.e., 
the space of possible combinations of feature values), whereas the program 
properties at the leaves provide analysis information corresponding to each 
partition, i.e. to the variants (configurations) that satisfy the constraints along 
the path to the given leaf node. The partitioning is dynamic, which means that 
partitions are split by feature-based tests (at #if directives), and joined when 
merging the corresponding control flows again. In terms of decision trees, this 
means that new decision nodes are added by feature-based tests and removed 
when merging control flows. In fact, the partitioning of the set of configurations 
is semantics-based, which means that linear constraints over numerical features 
that occur in decision nodes are automatically inferred by the analysis and do 
not necessarily occur syntactically in the code base. 


Our lifted abstract domain is parametric in the choice of numerical property 
domain [7,24] that underlies the linear constraints over numerical features labelling 
decision nodes, and the choice of the single-program analysis domain for leaf 
nodes. In fact, in our implementation, we also use numerical property domains 
for leaf nodes, which encode linear constraints over program variables. We 
rely on the well-known numerical domains, such as intervals [7], octagons [23], 
polyhedra [10], from the APRON library [19] to obtain a concrete decision 
tree-based implementation of the lifted abstract domain. This way, we have 
implemented a forward reachability analysis of C program families with numerical 
(and Boolean) features for the automatic inference of invariants. Our tool, called 
SPLNuM? ANALYZER‘, computes a set. of possible invariants, which represent 
linear constraints over program variables. We can use the implemented lifted 
static analyzer to check invariance properties of C program families, such as 
assertions, buffer overflows, null pointer references, division by zero, etc [8]. 

In summary, we make several contributions: (1) We propose a new, param- 
eterized lifted analysis domain based on decision trees for analyzing program 
families with numerical features; (2) We implement a prototype lifted static 
analyzer, SPLNUM? ANALYZER, that performs a forward analysis of #if-enriched 
C programs, where numerical property domains from the APRON library are 
used as parameters in the lifted analysis domain; (3) We evaluate our approach for 
automatic inference of invariants by comparing performances of lifted analyzers 
based on tuples and decision trees. 


2 Motivating Example 


To illustrate the potential of a decision tree-based lifted domain, we consider a 
motivating example using the code base of the following program family SIMPLE: 


“ Num? in the name of the tool refers to its ability to both handle Numerical features 
and to perform Numerical client analysis of SPLs (program families). 
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) int x := 10, y := 0; 
) while (x !=0) { 


© x := x-]; 
©) #if (SIZE < 3) y := y+1; #else y := y-1; #endif 
© #if (!B) y := 0; else skip; #endif ©} 


© assert (y > 1): 


The set F of features is {B, SIZE}, where B is a Boolean feature and SIZE is a 
numerical feature whose domain is [1,4] = {1,2,3,4}. Thus, the set of valid 
configurations is K = {B A (SIZE=1),B A (SIZE=2),B A (SIZE=3),B A (SIZE= 
4),-B A (SIZE = 1),-B A (SIZE = 2),-B A (SIZE = 3),-B A (SIZE = 4)}. The 
code of SIMPLE contains two #if directives, which change the value assigned 
to y, depending on how features from F are set at compile-time. For each 
configuration from K, a different variant (single program) can be generated 
by appropriately resolving #if-s. For example, the variant corresponding to 
configuration B A^ (SIZE=1) will have B and SIZE set to true and 1, so that the 
assignment y := y+1 and skip in program locations @ and ©, respectively, will 
be included in this variant. The variant for configuration ~B A (SIZE=4) will have 
features B and SIZE set to false and 4, so the assignments y := y-1 and y := 0 in 
program locations @ and ©), respectively, will be included in this variant. There 
are |KK| = 8 variants that can be derived from the family SIMPLE. 

Assume that we want to perform lifted polyhedra analysis of SIMPLE using 
the Polyhedra numerical domain [10]. The standard lifted analysis domain used 
in the literature [3,22] is defined as cartesian product of |K] copies of the basic 
analysis domain (e.g. polyhedra). Hence, elements of the lifted domain are tuples 
containing one component for each valid configuration from K, where each 
component represents a polyhedra linear constraint over program variables (x 
and y in this case). The lifted analysis result in location ©) of SIMPLE is an 
8-sized tuple shown in Fig. 1. Note that the first component of the tuple in 
Fig. 1 corresponds to configuration B A (SIZE=1), the second to B A (SIZE=2), 
the third to B A (SIZE=3), and so on. We can see in Fig. 1 that the polyhedra 
analysis discovers very precise results for the variable y: (y=10) for configurations 

A^ (SIZE=1), BA (SIZE= 2), and B A (SIZE=3); (y= —10) for configuration 
BA (SIZE=4); and (y=0) for all other configurations. This is due to the fact that 
the polyhedra domain is fully relational and is able to track all relations between 
program variables x and y. Using this result in location @, we can successfully 
conclude that the assertion is valid for configurations BA (SIZE=1), BA (SIZE=2), 
and B A (SIZE=3), whereas the assertion fails for all other configurations. 

If we perform lifted polyhedra analysis based on the decision tree domain 
proposed in this work, then the corresponding decision tree inferred in the final 
program location @ of SIMPLE is depicted in Fig. 2. Notice that the inner 
nodes of the decision tree in Fig. 2 are labeled with Interval linear constraints 
over features (SIZE and B), while the leaves are labeled with the Polyhedra 
linear constraints over program variables x and y. Hence, we use two different 
numerical abstract domains in our decision trees: Interval domain [7] for expressing 
properties in decision nodes, and Polyhedra domain [10] for expressing properties 
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BA (SIZE=1) BA (SIZE=2) BA (SIZE=3) 
“oOo os ov 
( y=10, x=0), fy=10, x=0}, fy=10,x=0), 

BA (SIZE=4) ABA(SIZE=1) | =BA(SIZE=2) 


——o ieee nO Oe 
ly=—10,x=0], [y=0,x=0], [y=0,x=0], 
=BA(SIZE=3) | —BA(SIZE=4) [y=10Ax=0] ly 10Ax=0] 


n 
(y=0,x=0), [y=0,x=0) ) 


Fig. 2: Decision tree-based invariant 
Fig. 1: Tuple-based invariant at at location ©) of SIMPLE (solid edges 
location @ of SIMPLE. = true, dashed edges = false). 


in leaf nodes. The edges of decision trees are labeled with the truth value of 
the decision on the parent node; we use solid edges for true (i.e. the constraint 
in the parent node is satisfied) and dashed edges for false (i.e. the negation of 
the constraint in the parent node is satisfied). As decision nodes partition the 
space of valid configurations K, we implicitly assume the correctness of linear 
constraints that take into account domains of numerical features. For example, 
the node with constraint (SIZE<3) is satisfied when (SIZE<3) A (1<SIZE<4), 
whereas its negation is satisfied when (SIZE>3) A (1<SIZE<4). The constraints 
(1<SIZE<4) represent the domain [1,4] of SIZE. We can see that decision trees 
offer more possibilities for sharing and interaction between analysis properties 
corresponding to different configurations, they provide symbolic and compact 
representation of lifted analysis elements. For example, Fig. 2 presents polyhedra 
properties of two program variables x and y, which are partitioned with respect 
to features B and SIZE. When (B A (SIZE < 3)) is true the shared property is 
(y =10,x=0), whereas when (B A =(SIZE < 3)) is true the shared property is 
(y=—10,x=0). When —B is true, the property is independent from the value 
of SIZE, hence a node with a constraint over SIZE is not needed. Therefore, all 
such cases are identical and so they share the same leaf node (y=0,x=0). In 
effect, the decision tree-based representation uses only three leafs, whereas the 
tuple-based representation uses eight properties. This ability for sharing is the 
key motivation behind the decision trees-based representation. 


3 A Language for Program Families 


Let F = {Aj,..., Ax} be a finite and totaly ordered set of numerical features 
available in a program family. For each feature A € F, dom(A) C Z denotes the 
set of possible values that can be assigned to A. Note that any Boolean feature 
can be represented as a numerical feature B € F with dom(B) = {0,1}, such 
that 0 means that feature B is disabled while 1 means that B is enabled. A 
valid combination of feature’s values represents a configuration k, which specifies 
one variant of a program family. It is given as a valuation function k : F > Z, 
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which is a mapping that assigns a value from dom(A) to each feature A, i.e. 
k(A) € dom(A) for any A € F. We assume that only a subset K of all possible 
configurations are valid. An alternative representation of configurations is based 
upon propositional formulae. Each configuration k € K can be represented by 
a formula: (Ay = k(Ai)) A... A (Ak = k(Ax)). We often abbreviate (B = 1) 
with B and (B = 0) with =B, for a Boolean feature B € F. The set of valid 
configurations K can be also represented as a formula: Vpexk. 

We define feature expressions, denoted FeatExp(F), as the set of propositional 
logic formulas over constraints of F generated by the grammar: 


0 ::= true | ep, D< er; |70| 601 A 02 | 01 V 02, er; =N | A | er, Der, 


where A € F, n € Z, 6 € {+, —,*}, and me {=, <}. We will use 0 € FeatExp(F) 
to write presence conditions. When a configuration k € K satisfies a feature 
expression 0 € FeatExp(F), we write k = 0, where | is the standard satisfaction 
relation of logic. We write [6] to denote the set of configurations from K that 
satisfy 0, that is, k € [6] iff k H 0. 


Example 1. For the SIMPLE program family from Section 2, the set of features 
is F = {B,SIZE} where dom(SIZE) = [1,4], and the set of configurations is 
K = {BA (SIZE=1),B A (SIZE=2),B A (SIZE=3),B A (SIZE=4), -B A (SIZE= 
1), 4B A (SIZE=2), -B A (SIZE=3), =B A (SIZE=4)}. For the feature expression 
(SIZE <3), we have [(SIZE<3)] = {BA (SIZE=1),B A (SIZE=2),B A (SIZE = 
3), =B A (SIZE = 1),-B A (SIZE = 2),-B A (SIZE = 3)}. Hence, B A (SIZE = 
2) K (SIZE<3) and B A (SIZE=4) KK (SIZE<3), where B A (SIZE=2) € K, 
B ^ (SIZE=4) € K, and (SIZE<3) € FeatExp(F). 


We consider a simple sequential non-deterministic programming language, 
which will be used to exemplify our work. The program variables Var are statically 
allocated and the only data type is the set Z of mathematical integers. To encode 
multiple variants, a new compile-time conditional statement is included. The new 
statement “#if (0) s #endif” contains a feature expression 0 € FeatExp(F) as a 
presence condition, such that only if 0 is satisfied by a configuration k € K the 
statement s will be included in the variant corresponding to k. The syntax is: 


s ::= skip | x:=e | s; s | if (e) then selse s | while (e) do s | #if (0) s #endif, 
e n= n | [n,n] |x| epe 


where n ranges over integers, [n, n’] over integer intervals, x over program variables 
Var, and ® over binary arithmetic operators. Integer intervals [n,n] denote a 
random choice of an integer in the interval. The set of all statements s is denoted 
by Stm; the set of all expressions e is denoted by Exp. 

A program family is evaluated in two stages. First, the C preprocessor CPP 
takes a program family s and a configuration k € K as inputs, and produces a 
variant (without #if-s) corresponding to k as the output. Second, the obtained 
variant is evaluated using the standard single-program semantics. The first 
stage is specified by the projection function P, which is an identity for all 
basic statements and recursively pre-processes all sub-statements of compound 
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int x := 10, y := 0; int x := 10, y := 0; int x := 10, y := 0; int x := 10, y :=0; 
while (x !=0) { while (x !=0) { while (x !=0) { while (x !=0) { 
x :=x-l; x := x-1; x := x-l; x := x-l; 
y := ytl; y := y-1; y= ytl; y := y-1; 
skip; } skip; } y= 0; } y :=0; } 


(a) Peace) (SIMPLE) (D) Pease) (SIMPLE) (C) P_y,(srzeay (SIMPLE) (d) Pp açsa) (SIMPLE) 


Fig. 3: Different variants of the program family SIMPLE from Section 2. 


statements. Hence, P(skip) = skip and P;(s;s’) = P(s);P(s'). The interesting 
case is “#if (0) s #endif”, where statement s is included in the variant if k = 6, 
P(s) ifk EO 
skip if k j0 
variants Pg, s1ze-1) SIMPLE), Pga (srzea) SIMPLE), P- B4(sıze=) SIMPLE), as well as 
P_a(stze1) (SIMPLE) shown in Fig. 3a, Fig. 3b, Fig. 3c, and Fig. 3d, respectively, 
are derived from the SIMPLE family defined in Section 2. 


otherwise, s is removed °: P;,(#if (0) s #endif) = . For example, 


4 Lifted Analysis based on Tuples 


Lifted analyses are designed by lifting existing single-program analyses to work 
on program families, rather than on individual programs. They directly analyze 
program families. Lifted analysis as defined by Midtgaard et. al. [22] rely on 
a lifted domain that is |KK|-fold product of an existing single-program analysis 
domain A defined over program variables Var. We assume that the domain A 
is equipped with sound operators for concretization ya, ordering Ca, join UA, 
meet Ma, bottom La, top Ta, widening Va, and narrowing ^q, as well as sound 
transfer functions for tests FILTER, and forward assignments ASSIGN,. More 
specifically, FILTER,(a: A, e : Exp) returns an abstract element from A obtained 
by restricting a to satisfy the test e, whereas ASSIGN4 (a : A,x:=e : Stm) returns 
an updated version of a by abstractly evaluating x:=e in it. 


Lifted Domain. The lifted analysis domain is defined as (AX, È, Ù, ñ, L, T), where 
A" is shorthand for the |K|-fold product Į Jpeg A, that is, there is one separate 
copy of A for each configuration of K. For example, consider the tuple in Fig. 1. 


Lifted Abstract Operations. Given a tuple (lifted domain element) @ € AK, the 
projection my selects the kt component of @. All abstract lifted operations are 
defined by lifting the abstract operations of the domain A configuration-wise. 


g(a) = [TIpex(va(t(@)), a Caz = T (G1) Ea te (Ga), for VkeK 
a U G2 = [Tye (tr) Ua (G2), a Naz: = [Tex (te (G1) Na m (G2) 
l = Ilex Ta = (Ta Ta), L = Theta = (lusla) 

ay V @z = [[pex(™ (G1) Var (@)), a Aa = [pew (te (@1)Aam (a) 


5 Since k € K is a valuation function, either k |= 0 holds or k + @ holds for any 0. 
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Lifted Transfer Functions. We now define lifted transfer functions for tests, 
forward assignments (ASSIGN), and #if-s (IFDEF). There are two types of 
tests: expression-based tests, denoted FILTER, that occur in while-s and if- 
s, and feature-based tests, denoted FEAT-FILTER, that occur in #if-s. Each 
lifted transfer function takes as input a tuple from AE representing the invariant 
before evaluating the statement (resp., expression) to handle, and returns a tuple 
representing the invariant after evaluating the given statement (resp., expression). 


FILTER(G: AÏ, e : Exp) = [ [peg (FILTER (74 (a), e)) 

SAT ST mā), ifk EO 
FEAT-FILTER(@: A5, 6: FeatExp(F)) = [ [pex t Eko 
ASSIGN (@: AÏ, x:=e: Stm) = J Jpeg (ASSIGN, (7, (€), x:=e)) 

IFDEF (a: AX, #if (0) s: Stm) =[s](FEAT-FILTER(G, ))UFEAT-FILTER(a, -0) 


where [[s](@) is the lifted transfer function for statement s. FILTER and ASSIGN 
are defined by applying FILTER, and ASSIGN, independently on each com- 
ponent of the input tuple a. FEAT-FILTER keeps those components k of the 
input tuple @ that satisfy 6, otherwise it replaces the other components with La. 
IFDEF captures the effect of analyzing the statement s in the components k of 
a that satisfy 0, otherwise it is an identity for the other components. 


Lifted Analysis. Lifted abstract operators and transfer functions of the lifted 
analysis domain A are combined together to analyze program families. Initially, 
we build a tuple @;, where all components are set to Ta for the first program 
location, and tuples where all components are set to 1, for all other locations. 
The analysis properties are propagated forward from the first program location 
towards the final location taking assignments, #if-s, and tests into account with 
join and widening around while-s. The soundness of the lifted analysis based on 
AE follows immediately from the soundness of all abstract operators and transfer 
functions of A (proved in [22]). 


Numerical Lifted Analysis The single-program analysis domain A can be instanti- 
ated by some of the well-known numerical property domains [24], such as Intervals 
(I, Er) [7], Octagons (O, Eo) [26], and Polyhedra (P,Cp) [10]. The elements of 
I are intervals of the form: +æ > 6, where x € Var, p € Z; the elements of O are 
conjunctions of octagonal constraints of the form +2; + x2 > 8, where x1, x2 € 
Var, 8 € Z; while the elements of P are conjunctions of polyhedral constraints of 
the form a,r7, +... + apr, + 6 > 0, where z1,...£k E Var,ay,...,a%, 8 E Z. 


5 Lifted Analysis based on Decision Trees 


We now introduce a new decision tree lifted domain. Its elements are disjunctions 
of leaf nodes that belong to an existing single-program domain A defined over 
program variables Var. The leaf nodes are separated by linear constraints over 
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numerical features, organized in the decision nodes. Hence, we encapsulate the 
set of configurations K into the decision nodes of a decision tree where each top- 
down path represents one or several configurations that satisfy the constraints 
encountered along the given path. We store in each leaf node the property 
generated from the variants representing the corresponding configurations. 


Abstract domain for decision nodes. We define the family of abstract domains for 
linear constraints Cp, which are parameterized by any of the numerical property 
domains D (intervals I, octagons O, polyhedra P). We use Cr = {+A; > £ | 
A; E€ F, 8 € Z} to denote the set of interval constraints, Co = {+Ai + A; > £ | 
Ai, A; E F, 6 € Z} to denote the set of octagonal constraints, and Cp = {a1 A1 + 
...+ak Ák +8 > 0 | Aı,... Ak E F, a1,...,Q@Qk, b € Z, gcd(jai|, sery lax], |G) = 1} 
to denote the set of polyhedral constraints. We have Cr C Co C Cp. 

The set Cp of linear constraints over features F is constructed by the 
underlying numerical property domain (D, Ep) using the Galois connection 
(P(Cp), Ep) a (D, Ep), where P(Cp) is the power set of Cp. The abstrac- 
tion function dö: P(Cp) — D maps a set of interval (resp., octagon, polyhedral) 
constraints to an interval (resp., an octagon, polyhedral) that represents a con- 
junction of constraints; the concretization function yc, : D > P(Cp) maps 
an interval (resp., an octagon, a polyhedron) that represents a conjunction of 
constraints to a set of interval (resp., octagonal, polyhedral) constraints. We have 
YC&(Tp) = 0 and yc, (Lp) = {Lc}, where Lo, is an unsatisfiable constraint. 

The domain of decision nodes is Cp. We assume F = {A,,..., Az} be a finite 
and totally ordered set of features, such that the ordering is Ay > Ag >... > Ag. 
We impose a total order <c, on Cp to be the lexicographic order on the coefficients 
Q1,...,@, and constant &œk+ı of the linear constraints, such that: 


(ai: Ay +... +ak' Ag+ Op4120) <cp (a4: Ar t+... + ap: Ak tapy 20) 
<=> Jj >0.Vi< j.(a; = a;) A (aj < a4) 


The negation of linear constraints is formed as: =(a,A, + ... ak Ak + 8 > 
0) = —a1Aı — ... — ak Ap — 8 — 1 > 0. For example, the negation of A — 3 > 0 
is the constraint —A + 2 > 0 (i.e., A < 2). To ensure canonical representation 
of decision trees, a linear constraint c and its negation ~c cannot both appear 
as nodes in a decision tree. For example, we only keep the largest constraint 
with respect to <q, between c and =c. For this reason, we define the equivalence 
relation =c, as € =C ac. We define (Cp, <c,) to denote (Cp/=, <C»), such that 
elements of Cp are constraints obtained by quotienting by the equivalence =c,. 


Abstract domain for constraint-based decision trees. A constraint-based decision 
tree t € T(Cp, A) over the sets Cp of linear constraints defined over F and the 
leaf abstract domain A defined over Var is either a leaf node <a> with a € A, 
or |c : tl,tr], where c € Cp (denoted by t.c) is the smallest constraint with 
respect to <c, appearing in the tree t, tl (denoted by t.l) is the left subtree of 
t representing its true branch, and tr (denoted by t.r) is the right subtree of t 
representing its false branch. The path along a decision tree establishes the set 
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of configurations (those that satisfy the encountered constraints), and the leaf 
nodes represent the analysis properties for the corresponding configurations. 


Example 2. The following two constraint-based decision trees tı and tz have 
decision nodes labelled with Interval linear constraints over the numeric feature 
SIZE with domain {1,2,3,4}, whereas leaf nodes are Interval properties: 


tı = [SIZE>4 :<[y> 2]>, <[y=0)>], t2 = [SIZE>2 :<[y > 0]>, <ly < 0)>] 


Abstract Operations. The concretization function yr of a decision tree t € 
T(Cp,A) returns ya(a) for k € K, where k satisfies the set C € P(Cp) of 
constraints accumulated along the top-down path to the leaf node a € A. More 
formally, yr(t) = YrlK] (t). The function yy accumulates into a set C € P(Cp) 
constraints along the paths up to a leaf node, which is initially equal to the set of 
implicit constraints over F, K=V,exk, taking into account domains of features: 


TrlCl (<a>) =] lkpc la), Frl le: tl, tr]) =FalCU{e}] (H) x FelCU{>¢}] (Er) 


Note that k = C is equivalent with ac,({k}) Ep ac,(C). Therefore, we can 
check k — C using the abstract operation Ep of the numerical domain D. 

Other binary operations of T(Cp, A) are based on Algorithm 1 for tree unifica- 
tion, which finds a common refinement (labelling) of two trees tı and tg by calling 
function UNIFICATION(t,,t2,K). It possibly adds new constraints as decision 
nodes (Lines 5-7, Lines 11-13), or removes constraints that are redundant (Lines 
3,4,9,10,15,16). The function UNIFICATION accumulates into the set C € P(Cp) 
(initialized to K, which represents implicit constraints satisfied by both tı and tg), 
constraints encountered along the paths of the decision tree. This set C is used 
by the function isRedundant(c, C), which checks whether the linear constraint 
c € Cp is redundant with respect to C by testing ac,(C) Ep ac, ({c}). Note that 
the tree unification does not lose any information. 


Example 3. Consider constraint-based decision trees tı and tz from Example 2. 
After tree unification UNIFICATION(t,, t2, K), the resulting decision trees are: 


tı = [SIZE > 4 :<[y > 2b, [SIZE > 2 :<[y = 0>, <ly = 0>], 
to = [SIZE > 4 :<[y > Of, [SIZE > 2 :<[y > 0>, <[y < 0] 


Note that UNIFICATION adds a decision node for SIZE > 2 to the right subtree of 
tı, whereas it adds a decision node for SIZE > 4 to tə and removes the redundant 
constraint SIZE > 2 from the resulting left subtree of t2. 


All binary operations are performed leaf-wise on the unified decision trees. 
Given two unified decision trees tı and t2, their ordering and join are defined as: 


KaL Lr Kae>= a; Ea Q2, [c: th, trı] Lr [c: tle, tra] = (th Cr tla) A (try Cr tr2) 
Kapur Ka2>= Kai Uaa,  fe:th,triJurle: tl, tre] =[c: th Urtle, tryUrtre] 


Similarly, we compute meet, widening, and narrowing of tı and t2. The top is a 
tree with a single T, leaf: Tr =<T,>, while the bottom is: Lr =<La>. 


Example 4. Consider the unified trees tı and tg from Example 3. We have that 
tı Ertz holds, and tyLipt, = [SIZE >4:<[y > 0), [SIZE>2:<[y > 0>, <ly <00>]. 


= 
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Algorithm 1: UNIFICATION (t1, te, C) 


1 if isLeaf(t,) A isLeaf (t2) then return (t1, tz); 

2 if isLeaf(t,) V (isNode(t;) A isNode(t2) A to.c <cp ti.c) then 

3 if isRedundant(t2.c,C) then return UNIFICATION (t1, t2.l, C); 
4 if isRedundant(—t2.c,C) then return UNIFICATION (t1, t2.r, C); 
5 | (l1,l2) = UNIFICATION (ty, to.l,C U {t2.c}); 

6 (ri, r2) = UNIFICATION (t1, te.r, CU {~t2.c}); 

7 return ([t2.¢ : l, ri, [te-c: le, r2]); 

8 
9 


if isLeaf (t2) V (isNode(t1) A isNode(t2) A t1.€ <cp ta.c) then 

if isRedundant(tı.c, C) then return UNIFICATION (t1.l, t2, C); 
10 if isRedundant(~tı.c, C) then return UNIFICATION (t1.r, t2, C); 
11 | (lh,l2) = UNIFICATION (t.l, t2, C U {t1.c}); 

12 | (ri,r2) = UNIFICATION (t.r, t2, C U {atc}; 

13 return ([fé1.c: h, ri], [ti-c: l2,r2]); 

14 else 

15 if isRedundant(tı.c, C) then return UNIFICATION (¢1.l, t2.l, C); 
16 if isRedundant(~tı.c, C) then return UNIFICATION (t1.r, t2.r, C); 
17 | (l1,l2) = UNIFICATION (t.l, t2.l, C U {t1.c}); 

18 (r1, r2) = UNIFICATION (t1.r, t2.r, C U {711.c})5 

19 return ([é1.c: h, ri], [ti1-c : l2,r2]); 


Algorithm 2: ASSIGNr(t, x:=e) 


1 if isLeaf(t) then return ASSIGN, (t, x: =e); 
2 return [t.c : ASSIGNr(t.l, x:=e) ,ASSIGNr(t.r, x:=e)]; 


Transfer functions. The transfer functions for forward assignments (ASSIGN?) 
and expression-based tests (FILTER) modify only leaf nodes of a constraint- 
based decision tree. In contrast, transfer functions for variability-specific con- 
structs, such as feature-based tests (FEAT-FILTER?) and #if-s (IFDEF) add, 
modify, or delete decision nodes of a decision tree. This is due to the fact that 
the analysis information about program variables is located in leaf nodes, while 
the information about feature variables is located in decision nodes. 

Transfer function ASSIGN? for handling an assignment x:=e in the input tree 
t is described by Algorithm 2. Note that x € Var, and e € Exp may contain only 
program variables. We apply ASSIGN~a to each leaf node a of t, which substitutes 
expression e for variable x in a. Similarly, transfer function FILTER? for handling 
expression-based tests e € Exp is implemented by applying FILTER, leaf-wise. 

Transfer function FEAT-FILTER 7 for feature-based tests 0 is described by 
Algorithm 3. It reasons by induction on the structure of 6 (we assume negation is 
applied to atomic propositions). When @ is an atomic constraint over numerical 
features (Lines 2,3), we use FILTER» to approximate 0, thus producing a set of 
constraints J, which are then added to the tree t, possibly discarding all paths of 
t that do not satisfy 6. This is done by calling function RESTRICT(t, K, J), which 
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Algorithm 3: FEAT-FILTERr(t, 0) 


1 switch 0 do 

2 case (er, D< erz) || (a(er, X er,)) do 

3 J = FILTERp(T», 0); return RESTRICT(t, K, J) 

4 case 0; A 02 do 

5 return FEAT-FILTER?(t, 01) Mr FEAT-FILTERr(t, 02) 


ase 0, V 02 do 
return FEAT-FILTER¢(t, 01) Ur FEAT-FILTER;(t, 02) 


io) 
0 


x 


adds linear constraints from J to t in ascending order with respect to <c, as 
shown in Algorithm 4. Note that 0 may not be representable exactly in Cp (e.g., 
in the case of non-linear constraints over F), so FILTERp may produce a set of 
constraints approximating it. When @ is a conjunction (resp., disjunction) of two 
feature expressions (Lines 4,5) (resp., (Lines 6,7)), the resulting decision trees 
are merged by operation meet Mr (resp., join Ur). Function RESTRICT(t, C, J), 
described in Algorithm 4, takes as input a decision tree t, a set C of linear 
constraints accumulated along paths up to a node, and a set J of linear constraints 
in canonical form that need to be added to t. For each constraint 7 € J, there 
exists a boolean bj that shows whether the tree should be constrained with 
respect to j or with respect to ~j. When J is not empty, the linear constraints 
from J are added to t in ascending order with respect to <c,. At each iteration, 
the smallest linear constraint j is extracted from J (Line 9), and is handled 
appropriately based on whether j is smaller (Line 11-15), or greater or equal 
(Line 17-21) to the constraint at the node of t we currently consider. 
Finally, transfer function IFDEF 7 is defined as: 


IFDEF 7(t, #if (0) s) = [s}r(FEAT-FILTER(t, 0)) Ur FEAT-FILTER«(t, 79) 


where [s]r(t) denotes the transfer function in T(Cp, A) for statement s. 

After applying transfer functions, the obtained decision trees may contain 
some redundancy that can be exploited to further compress them. Function 
COMPRESS y(t, C), described by Algorithm 5, is applied to decision trees t in order 
to compress (reduce) their representation. We use five different optimizations. 
First, if constraints on a path to some leaf are unsatisfiable, we eliminate that 
leaf node (Lines 9,10). Second, if a decision node contains two same subtrees, 
then we keep only one subtree and we also eliminate the decision node (Lines 
11-13). Third, if a decision node contains a left leaf and a right subtree, such that 
its left leaf is the same with the left leaf of its right subtree and the constraint in 
the decision node is less or equal to the constraint in the root of its right subtree, 
then we can eliminate the decision node and its left leaf (Lines 14,15). A similar 
rule exists when a decision node has a left subtree and a right leaf (Lines 16,17). 


Lifted analysis. The abstract operations and transfer functions of T(Cp, A) can 
be used to define the lifted analysis for program families. Tree tin at the initial 
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Algorithm 4: RESTRICT (t, C, J) 
1 if isEmpty(J) then 
2 if isLeaf(t) then return t; 
3 if isRedundant(t.c,C) then return RESTRICT(t.l, C, J); 
4 if isRedundant(—t.c,C) then return RESTRICT(t.r, C, J); 
5 | [| =RESTRICT(t.1,C U {t.c}, J) ; 
6 | r=RESTRICT(t.r, C U {~t.c}, J) ; 
7 return ([t.c:l,r]); 
8 else 
9 j = ming, (J) ; 
10 if isLeaf(t) V (isNode(t) A j <cp t.c) then 


11 if isRedundant(j, C) then return RESTRICT (t, C, J\{j}); 
12 if isRedundant(—j,C) then return «LAX; 

13 if j =c, t.c then (if bj then t = t.l; else t = t.r) ; 

14 if bj then return ([j : RESTRICT(t, C U {j}, JUJ}, <La) ; 
15 else return ([j :<La>, RESTRICT(t, CU {7}, J\{7})]) ; 
16 else 

17 if isRedundant(t.c,C) then return RESTRICT(t.l, C, J); 
18 if isRedundant(—t.c,C’) then return RESTRICT(t.r, C, J); 
19 l = RESTRICT(t.l, C U {t.c}, J) ; 

20 r = RESTRICT(t.r, C U {t.c}, J) ; 

21 return ([t.c:l,r]); 


location has only one leaf node T, and decision nodes that define the set K. Note 
that if K = true, then tin = Tr. In this way, we collect the possible invariants in 
the form of decision trees at all program locations. 

We establish correctness of the lifted analysis based on T(Cp, A) by showing 
that it produces identical results with tuple-based domain AS. Let [s]r and [s] 
denote transfer functions of statement s in T(Cp, A) and AK, respectively. Recall 


that Gin = [keg Ta, and so yr(tin) = ¥(Gin)- 
Theorem 1. y7([s]1(tin)) = 7([s] (Gin) - 


Proof. The proof is by induction on the structure of s. We consider the most 
interesting cases: #if (0) s #endif. Transfer functions for #if are identical in 
both lifted domains. We only need to show that FEAT-FILTER(G, 0) and FEAT- 
FILTER; (t, 0) are identical. This is shown by induction on @ [13]. 


Example 5. Let us consider the code base of a program family P given in Fig. 4. 
It contains only one numerical feature SIZE with domain N. The decision tree 
inferred at the final location @ is depicted in Fig. 5. It uses the Interval domain 
for both decision and leaf nodes. Note that the constraint (SIZE < 3) does 
not explicitly appear in the code base, but we obtain it in the decision tree 
representation. This shows that partitioning of the configuration space K induced 
by decision trees is semantics-based rather than syntactic-based. 
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Algorithm 5: COMPRESS7(t, C) 


switch t do 
case <n>do 
return <n>; 


1 

2 

3 

4 ase [t.c :l,r] do 

5 l’ = COMPRESS r(t.l, C U {t.c}) ; 
6 

7 

8 

9 


ie) 


r’ = COMPRESS (t.r, C U {~t.c}) ; 

switch I',r’ do 

case n>, <n/>> do 

if UNSAT(C U {t.c}) then return «n>; 


10 if UNSAT(C U {~t.c}) then return «n>; 
11 if n; = ni then return «n>; 
12 case |c : l1, rı], [c2 : l2, r2] when cı = c2 A lı = l2 A rı = r2 do 
13 return [ci : l1, rıl; 
14 case Kn», [c2 : le, r2] when Kn; >= le Ac <cp c2 do 
15 return [cə : l2, ral]; 
16 case |c : lh, ri], <n> when Kn, >= rı Acı <cy c do 
17 return [ci : l, rıl; 
18 case default: do 
19 return [t.c: l’,r’]; 
ISIZE<3} 
Q) int x := 0; N 
© #if (SIZE < 4) x :=x+]; #else x :=x-l; #endif NG 
© #if (SIZE==3 || SIZE==4) x := x-2; #endif © `N 


b= b=] 


Fig. 4: Code base for program family P. Fig. 5: Decision tree at loc. © of P. 


Example 6. Let us consider the code base of a program family P’ given in Fig. 6. 
It contains one numerical feature A with domain [1,4] and a non-linear feature 
expression A*A < 9. At program location @), FEAT-FILTERr(<x = 0>>,A%*A < 9) 
returns an over-approximating tree <x = 0>>, whereas FEAT-FILTER;?(<x = 
O>>,7(A * A < 9)) returns [A> 3,<x = 0,<L)>]. In effect, we obtain an 
over-approximating result at the final program location @) as shown in Fig. 7. 
The precise result at the program location @), which can be obtained in case we 
have numerical domains that can handle non-linear constraints, is given in Fig. 8. 
We observe that when (A < 2), we obtain an over-approximating analysis result 
(—1<x<1 instead of x = —1) due to the over-approximation of the non-linear 
feature expression in the numerical domains we use. 
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Q) int x := 0; 
C) #if (A*A < 9) x := x+]; 
#else x := x-l; #endif ©) 


[—1<<1]) [[x=1] [x=-1] 


Fig. 7: Over-approximating Fig. 8: Precise decision 
Fig. 6: Code base for P’. decis. tree at loc. @ of P’. tree at loc. @ of P’. 


6 Evaluation 


Implementation We have developed a prototype lifted static analyzer, called 
SPLNuM? ANALYZER, that uses lifted abstract domains of tuples AX and deci- 
sion trees T(Cp, A). The abstract domains A for encoding properties of tuple 
components and leaf nodes as well as the abstract domain D for encoding linear 
constraints over numerical features are based on intervals, octagons, and poly- 
hedra domains. Their abstract operations and transfer functions are provided 
by the APRON library [19]. Our proof-of-concept implementation is written 
in OCAML and consists of around 6K lines of code. The current front-end of 
the tool accepts programs written in a (subset of) C with #if directives, but 
without struct and union types. It currently provides only a limited support 
for arrays, pointers, and recursion. The only basic data type is mathematical 
integers. SPLNuM? ANALYZER automatically infers numerical invariants in all 
program locations corresponding to all variants in the given family. We use 
delayed widening and narrowing [7,24] to improve the precision of while-s. 


Experimental setup and Benchmarks All experiments are executed on a 64-bit 
Intel® Core?™ 17-8700 CPU@3.20GHz x 12, Ubuntu 18.04.5 LTS, with 8 GB 
memory, and we use a timeout value of 300 sec. All times are reported as average 
over five independent executions. The implementation, benchmarks, and all 
results obtained from our experiments are available from: https://github.com/ 
aleksdimovski/SPLNUM2Analyzer. In our experiments, we use three instances 
of our lifted analysis via tuples: Ar (I), Am(O), and Ay(P), and via decision 
trees: Ar(I), Ar(O), and Ar(P), which use intervals, octagons, and polyhedra 
domains as parameters, respectively. 

SPLNuM? ANALYZER was evaluated on a dozen of C programs collected from 
several categories of the 8th International Competition on Software Verification 
(SV-COMP 2019, https://sv-comp.sosy-lab.org/2019/): loops, loop-invgen 
(invgen for short), loop-lit (lit), termination-crafted (crafted); as well 
as from the real-world BusyBox project (https://busybox.net). In the case of 
SV-COMP, we have first selected some numerical programs with integers, and 
then we have manually added variability (features and #if directives) in each 
of them. In the case of BusyBox, we have first selected some programs with 
numerical features, and then we have simplified those programs so that our tool 
can handle them. For example, any reference to a pointer or a library function is 
replaced with [—co, +00]. Table 1 presents characteristics of the benchmarks. We 
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Table 1: Performance results for lifted static analyses based on decision trees vs. 
tuples (which are used as baseline). All times are in seconds. 


Ar) | Ar) | Ar(P) 


TIME IMPR. TIME IMPR. TIME ĪMPR. 


Benchmark folder |F| |K| Loc 


hhk2008.c lit 216 30 0.023 10x 0.153 4.5x 0.074 12.5x 
gsv2008.c lit 25 25 0.013 1.5x 0.035 1.2x 0.037 2x 
gcnr2008.c lit 2 25 30 0.021 2x 0.070 2.1x 0.102 2.6x 
Toulouse*.c crafted 3 125 75 0.043 6.1x 0.259 2.4x 0.175 7.6x 
Mysore.c crafted3 125 35 0.019 3.7x 0.090 1.1x 0.056 5.4x 
copyfd.c BusyBox 1 16 84 0.013 3.9x 0.041 6.2x 0.054 5.2x 
real_path.c BusyBox2 128 45 0.023 14x 0.077 28x 0.085 32x 


half_2.c invgn 2 36 60 0.010 2.4x 0.017 3.5x 0.022 4.6x 
heapsort.c invgen 2 36 60 0.036 2.2x 0.226 11x 0.191 2.0x 
seq.c invgn 3 125 40 0.039 9.3x 0.460 4.3x 0.164 11x 
eqi.c loops 2 36 20 0.015 3.4x 0.049 3.1x 0.052 4x 
eq2.c loops 2 25 20 0.013 1.9x 0.047 1.3x 0.040 1.9x 
sum01*.c loops 2 25 20 0.016 1.7x 0.086 1.5x 0.062 2.2x 

3 

2 


list: the file name (Benchmark), the category (folder), the number of features 
and configurations (|F|, |K|), and lines of code (LOC). 


Performance Results Table 1 shows the results of analyzing our benchmark files 
by using different versions of our lifted static analyses based on decision trees 
and on tuples. For each version of decision tree-based lifted analysis, there are 
two columns. In the first column, TIME, we report the running time in seconds 
to analyze the given benchmark using the corresponding version of lifted analysis 
based on decision trees. In the second column, IMPR., we report the speed up 
factor for each version of lifted analysis based on decision trees relative to the 
corresponding baseline lifted analysis based on tuples (Ar(I) vs. Aq (I), Ar(O) 
vs. A7(O), and Ar(P) vs. Am(P)). The performance results confirm that sharing 
is indeed effective and especially so for large values of |K]. On our benchmarks, 
it translates to speed ups (i.e., (Ar(—) vs. Am(—)) that range from 1.1 to 4.6 
times when |K|< 100, and from 3.7 to 32 times when |K| > 100. 


Computational tractability The tuple-based lifted analysis A77(—) may become 
very slow or even infeasible for very large configuration spaces |K]|. We have tested 
the limits of Ar (P) and Ar(—). We took a method, test*(), which contains n 
numerical features A1,..., An, such that each numerical feature A; has domain 
dom(A;) = [0, k — 1] = {0,...,k — 1}. The body of test*() consists of n sequen- 
tially composed #if-s of the form #if (A; = 0) i := i+1 #else i := 0 #endif 
For example, test3() with two features A, and A2, whose domain is (0, 2], is: 
int i := 0; 
#if (A, =O)i: 
#if (Ap =O) i: 


itl #else i := 0 #endif 
i+l #else i := 0 #endif © 


C 
© 
© 
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Ay =0AAQ=0 Ay =OAAQ=1 A] =0AAQ=2 


[Aen atti ates 
(E=32] 0] i= 


Ay =1AaQ=0 A] =1^42=1 Ay =1^42=2 


en aA AM 
[i=1]. [i=0], [i =O], 


Ay =2AAQ=0 Ay =2AAQ=1 Ay =2AAQ=2 


m Or 
i=l, f=—O] , E=0)) [i=2] [i=] 


Fig. 9: Ar(P) results at © of test3(). Fig. 10: Ap(P) results at ®© of test3(). 


Subject to the chosen configuration, the variable i in location @ can have a 
value in the range from value 2 when A; and Ag are assigned to 0, to value 0 when 
A2 > 1. The analysis results in location @ of test3() obtained using Ay(P) and 
Ar(P) are shown in Fig. 9 and Fig. 10, respectively. Arr (P) uses tuples with 9 
interval properties (components), while Ar(P) uses 3 interval properties (leafs). 


Table 2: The performance results of analyzing test*. 
k=3 k=5 | k=7 


R | 


An(P) Ar(P) IMPR. An(P) Ar(P) IMPR. An(P) Ar(P) IMPR. 


5 0.164 0.137 1.2x 2.859 0.139 20.6x 19.976 0.138 144.7x 
6 0.701 0.293 2.4x 23.224 0.294 79.1X inteasivie 0.299 cox 
8 17.420 1.761 9.9x infeasible 1.765 00X infeasible 1.767 OOX 
10 278.7 5.591 49.8% infeasible 5.596 cox infeasible 5.639 0X 
11 infeasitie 13.807 COX infeasible 13.859 00x infeasible 13.809 cox 
14 inteasitie 327.10 cox infeasible 442.23 00X infeasible 499.19 cox 


We have generated methods test*() by gradually increasing variability. In 
general, the size of tuples used by Ar (P) is k”, whereas the number of leaf 
nodes in decision trees used by Ar(P) in the final program location is n + 1. 
The performance results of analyzing test*, for different values of n and k, 
using Ay(P) and Ar(P) are shown in Table 2. In the columns IMPR., we report 
the speed-up of Ar(P) with respect to Aq(P). We observe that Ar(P) yields 
decision trees that provide quite compact and symbolic representation of lifted 
analysis results. Since the configurations with equivalent analysis results are 
nicely encoded using linear constraints in decision nodes, the performance of 
Ar(P) does not depend on k, but only depends on n. On the other hand, the 
performance of Arr (P) heavily depends on k. Thus, within a timeout limit of 300 
seconds, the analysis Ar (P) fails to terminate for test?,, testg, and testů. In 
summary, we can conclude that decision trees At(P) can not only greatly speed 
up lifted analyses, but also turn previously infeasible analyses into feasible. 
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7 Related Work 


Decision-tree abstract domains have been successfully used in the field of abstract 
interpretation recently [18,9,4,26]. Decision trees have been applied for the disjunc- 
tive refinement of Interval domain [18]. That is, each element of the new domain 
is a propositional formula over interval linear constraints. Segmented decision 
tree abstract domains has also been defined [9,4] to enable path dependent static 
analysis. Their elements contain decision nodes that are determined either by 
values of program variables [9] or by the branch (if) conditions [4], whereas the 
leaf nodes are numerical properties. Urban and Mine [26] use decision tree-based 
abstract domains to prove program termination. Decision nodes are labelled 
with linear constraints that split the memory space and leaf nodes contain affine 
ranking functions for proving program termination. 

Recently, two main styles of static analysis have been a topic of considerable 
research in the SPL community: a dataflow analysis from the monotone framework 
developed by Kildall [21] that is algorithmically defined on syntactic CFGs, and an 
abstract interpretation-based static analysis developed by Cousot and Cousot [7] 
that is more general and semantically defined. Brabrand et. al. [3] lift a dataflow 
analysis from the monotone framework, resulting in a tuple-based lifted dataflow 
analysis. Another efficient implementation of the lifted dataflow analysis from the 
monotone framework is based on using variational data structures [27]. Midtgaard 
et. al. [22] have proposed a formal methodology for systematic derivation of tuple- 
based lifted static analyses in the abstract interpretation framework. A more 
efficient lifted static analysis by abstract interpretation obtained by improving 
representation via BDD domains is given in [11]. Another approach to speed up 
lifted analyses is by using so-called variability abstractions [14,15], which are 
used to derive abstract lifted analyses. They tame the combinatorial explosion 
of the number of configurations and reduce it to something more tractable by 
manipulating the configuration space. The work [5] presents a model checking 
technique to analyze probabilistic program families. 


8 Conclusion 


In this work we employ decision trees and widely-known numerical abstract 
domains for automatic inference of invariants in all locations of C program 
families that contain numerical features. In future, we would like to extend the 
lifted abstract domain to also support non-linear constraints [17]. An interesting 
direction for future work would be to explore possibilities of applying variability 
abstractions [14] as yet another way to speed up lifted analyses. We can also 
define a backward lifted analysis in combination with a preliminary forward lifted 
analysis to infer the necessary preconditions in order a given assertion to be 
satisfied or violated. The obtained preconditions in the form of linear constraints 
can be analyzed using model counting techniques to quantify how likely is an 
input or a variant to satisfy them [16,12]. 
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Abstract. When using multiple models to describe a (software) system, 
one can use a network of model transformations to keep the models 
consistent after changes. No strategy exists, however, to orchestrate the 
execution of transformations if the network has an arbitrary topology. 
In this paper, we analyse how often and in which order transformations 
need to be executed. We argue why linear execution bounds are too 
restrictive to be useful in practice and prove that there is no upper bound 
for the number of necessary executions. To avoid non-termination, we 
propose a conservative strategy that makes execution failures easier to 
understand. These insights help developers and users of transformation 
networks to understand under which circumstances their networks can 
terminate. Additionally, the proposed strategy helps them to find the 
cause when a network cannot restore consistency. 


Keywords: model consistency - model transformation networks 


1 Introduction 


When modelling systems, one is often confronted with the task of model consis- 
tency: Since model-driven development aims at separating concerns by tailoring 
models to the needs of the people working on the system, there are typically 
different models, each one capturing the parts of the system that are relevant to 
the model’s target audience. All those models taken together should describe a 
coherent system and not contain contradictory information. We say that the mod- 
els should be consistent. Automatic detection and resolution of inconsistencies is, 
however, still poorly addressed in current development processes [12]. 

There are different means of maintaining consistency. A popular one is to define 
incremental model transformations, which update models based on information 
that was changed in one of them. While there has been significant research 
on model transformations themselves, particularly on binary transformations, 
maintaining consistency of multiple models is less researched [2]. There are 
approaches for multiary model transformations which can transform between 
multiple models by means of a single transformation. Nevertheless, one will likely 
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also want to be able to combine multiple transformations—binary or multiary—to 
maintain consistency, creating a transformation network. Unlike using a single, 
overarching transformation, defining a network makes it possible to reuse modular 
ones. Additionally, knowledge about consistency between certain types of models 
is often distributed across domain experts [13]. This can be accommodated by 
transformation networks, because every domain expert can define transformations 
independently and according to their view on consistency. 

To the best of the authors’ knowledge, no strategy that determines an execu- 
tion order of transformations to maintain consistency in a network with arbitrary 
topology has been presented yet. Existing work proposes, for example, defining 
an execution order explicitly [23, 35] or deriving a topological order [30]. Most 
approaches restrict the supported kinds of network topologies to such in which 
each transformation only needs to be executed once. 

In this paper, we research properties and limitations of a universal strategy 
that executes a transformation network of arbitrary topology. We show that 
strategies that apply each transformation only once are not useful in practice. 
At the other end of the spectrum, we prove that not limiting the number of 
transformation executions does, in general, lead to non-termination. Based on 
the insight that a universal strategy can only operate conservatively, we derive a 
practicable strategy. In detail, we make the following contributions: 


Formalisation (C1): We formalise transformation networks and execution 
strategies to precisely define their expected properties. 

Conservativeness Proof (C2): We prove that a universal execution strategy 
must operate conservatively to avoid non-termination. 

Strategy Design (C3): We propose a strategy that improves explainability 
whenever no consistent models are found. 


The contributions establish fundamental knowledge about the design space of 
network execution strategies, their undecidability, and difficulties in reducing 
conservativeness. The proposed strategy helps transformation network developers 
and users to find the reasons when an execution does not yield consistent models. 


2 Problem Statement 


In this section, we will further motivate our research by giving an example and 
clarifying its context. We provide a formalisation for transformation networks 
and execution strategies to generate a common understanding and formal basis 
for transformation network orchestration, constituting contribution C1. 


2.1 Motivating Example 


Figure 1 depicts a software project whose contributors take the roles of architects, 
developers and user experience (UX) designers. One person can take multiple 
roles, but every role has a particular view on the project and uses related tools. 
Architects use a UML-based tool to analyse and plan the architecture. Developers 
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Fig. 1. Example for a transformation network in model-driven (software) development. 


program the software in Java. These two models overlap: Although they cannot 
be derived completely from each other, the implementation should follow the 
architecture and architects want to see how code changes affect the architecture. 

UX designers develop the UI for the software. Their designs overlap with 
the UML model, because, first, the software’s requirements mandate certain 
properties of the UI, and, second, the architecture may restrict which information 
can be shown at which point in the interface. The UI design also overlaps with 
the code, since static parts of the UI can be derived from the UI model. Ideally, 
changes in the UI code can even be propagated back into the UI model. 

The developers use OpenAPI™ [32] to exchange specifications of HTTP APIs. 
These specifications overlap with the parsing and serialisation code. Architects 
want to analyse how their architecture choices influence performance, using the 
Palladio Component Model (PCM) [24]. The architecture specification used in 
the PCM overlaps with the one defined in UML. Additionally, the PCM model 
contains information about performance properties and the deployment structure, 
which can partially be derived from the code. 

Those relations can be encoded in transformations to avoid re-specification 
of similar information, such as the architecture in PCM and UML, to derive 
information, like appropriate Java stubs from OpenAPI specifications, and to 
preserve information consistency. Figure 1 shows the resulting transformation 
network. In this paper, we will find an execution strategy for such transformations, 
which is needed to correctly propagate changes from one model to the others. 


2.2 Context 


We discuss model transformation networks in a specific usage context. We assume 
that different roles are involved in a development project, each using some 
models to describe their view of the system. The models are kept consistent 
by model transformations. For the sake of simplicity, we only discuss binary 
transformations between two models. To foster independent specification and reuse 
of transformations, we assume that they are not tailor-made, but may be general- 
purpose. As a consequence, we cannot assume that the models or transformations 
are or can be aligned, for example, to ensure that their execution in a specific 
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order always results in consistent models. Neither can we assume that the network 
has a certain topology. We do, however, assume that all transformations are in 
accordance to a well-defined overall notion of consistency (reaching a consistent 
state would be impossible otherwise). This means that all requirements we pose 
on the transformations must only concern a transformation itself. A requirement 
like “no transformation overwrites the result of another” would not fit our context. 

We require that transformations are synchronising |4], i.e., that they can deal 
with the situation that both of their models have been changed. This is essential 
to find an execution strategy: When propagating changes in a transformation 
network that contains cycles, it will inevitably happen that both models that are 
connected by a transformation will be changed. In addition, the well-researched 
bidirectional transformations only change one of the models [28] and could in 
such a situation be forced to overwrite changes to yield a consistent result. This 
assumption also enables concurrent modifications by different project members. 


2.3 Formalisation 


We are not concerned with how models are structured, so we simply resort to 
defining a universe M that contains all models. First, we define the kind of 
transformations that we use: 


Definition 1. A synchronising binary transformation (syncx) f is a function 
that updates two models: 


t: (M x M) > (M x M) 
A syncx’ image consists of fixed points: 
Va € M Yb E€ M : é(€(a,b)) = f(a, b) 
The universe of all syncx for M is called T. 


This formalisation is a simplification sufficient for the purposes of this paper. 
In practice, transformations will, for example, be allowed to indicate an error 
instead of being required to always produce appropriate new models. 

In comparison to existing formalisms [28], there is no consistency relation in 
the definition of a syncx. For our purposes, the consistency relation is not part 
of a synex, but rather encoded implicitly in the syncx’ behaviour. We assume 
that the transformations are correct and hippocratic [28] with regard to their 
implicit consistency relation and can then recover the relation: 


Definition 2. The consistency relation Re of syncs t is given by: 


fig= { (a, b) | t(a,b) = (a, b)} 
This paper focuses on transformation networks that are created when com- 
bining multiple syncx: 
Definition 3. A transformation network N =: ((V, E), T) consists of a directed, 
connected, self-loop-free graph G = (V, E) and a syncx assignment T: E > T. 


Any two vertices {a,b} C V have at most one edge between them: (a,b) € E => 
(b,a) € E. The universe of all model transformation networks for M is called U. 
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A transformation network captures the topology and the used transformations. 
There is no inherent reason to exclude multigraphs or self-loops. We use this 
simpler definition because it makes it easier to argue about the networks without 
restricting expressiveness. We use directed edges instead of undirected ones to 
provide a notion of the “left” and “right” model for a syncx. The edges’ direction 
does not indicate anything about the direction of change propagation. We will 
usually regard the network as given and try to find suitable model assignments: 


Definition 4. For a transformation network N = ((V, £),T), a model assign- 
ment M is a function M: V > M. 


Naturally, we are particularly interested in model assignments that are con- 
sistent with the transformations: 


Definition 5. For a transformation network N = ((V, E), T), a model assign- 
ment M is consistent if, and only if 


V(a, b) EE: (M(a), M(b)) € RT (a,b) 
The set of all consistent model assignments for N is called Ry. 


We use the following additional notation in this paper: 


— “A — B” for the set of functions from set A to set B 

— “f: A >» B” for a partial function f from A to B 

— “f(x) = L” to mean that a partial function f is not defined at x 
— “Im(f)” to denote the image of a function f 


2.4 Problem Description 


Our goal is to find an algorithm that, given a transformation network N =: 
((V, E),T) € U and a model assignment M, finds a consistent model assignment 
M' by applying transformations in Im(T). We call such an algorithm a “(trans- 
formation network) execution strategy”. It is “universal” if it is parametrised by 
and thus defined for every network. 


Definition 6. A universal execution strategy determines an order (i.e., a per- 
mutation with duplicates) of transformations in Im(T) for a given transformation 
network N=:((V,E),T) € U and model assignment M € (V > M). It realises a 
partial function S: U x (V > M) » (V > M). 


An execution strategy finds a new model assignment only by executing the 
transformations of the network, as more precisely defined by Klare et al. [15, 
Definition 8]. If S(N, M) 4 L, we say that the strategy “resolves” N and M. If 
S(N,M) = L, we say that the strategy fails. We have further requirements: 
Requirement 1. An execution strategy must be correct: 

VN=((V,E),T) €U YM e (V > M): S(N,M) € Ry U{L} 
Requirement 2. An execution strategy must be hippocratic: 
YN=((V,F),T) € U VM. € Ry : S(N, Me) = Me 
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An execution strategy will not always be able to find a consistent new model 
assignment (i.e., there will be some N, M such that S(N, M) = L). First, there 
may not be a consistent model assignment at all (i.e., Ry = Ø). Second, there may 
be a consistent model assignment but no execution order of the transformations 
that yields that assignment [30, 16]. We call such inputs “unresolvable” [30]. 
Conversely, if there is an execution order of the transformations that yields a 
consistent model assignment, we call the inputs “resolvable”. 

An execution strategy may even fail for resolvable inputs: The execution 
strategy may not “find” a consistent model assignment, even though it is reachable. 
For example, the strategy may abort before having executed the transformations 
often enough, or finding the assignment might require an order of execution 
which the strategy does not consider. We call such a strategy “conservative”: 


Definition 7. An execution strategy S is conservative if it is correct and if there 
can be resolvable inputs N, M with S(N,M) = L. 


The higher the probability that an execution strategy yields a result for 
resolvable inputs (we also say the lower its “level of conservativeness”), the more 
useful the strategy will be. It is, however, also desirable that the strategy is 
predictable, meaning that one can determine beforehand for which inputs the 
strategy will succeed. For example, it would be useful to know whether a strategy 
yields a result for a given network for any resolvable model assignment. Informally 
speaking, we would like to have an “easy-to-check” criterion for transformation 
networks determining whether this is the case. An even better criterion could be 
applied to a single syncx, such that the strategy can resolve all inputs with a 
network of syncx that fulfil the criterion. This would be ideal for the motivated 
context of independently developing and freely combining syncx to a network. 

To summarise, we aim to find a correct, hippocratic execution strategy that is 
able to keep models consistent via transformation networks. The strategy should 
succeed for realistic inputs with a high probability. Additionally, we aim to find 
criteria that determine the cases in which the strategy will succeed. 


3 Related Work 


Approaches for restoring model consistency have been subject to intensive research, 
surveyed by Macedo et al. [21]. Model transformations are a well-researched option, 
and several tools and languages have been developed to support them [27, 18, 25]. 
Research has, however, mainly focused on consistency between two models, which 
also concerns theoretical properties like termination as one of the properties 
that we investigate for the execution of transformation networks [7]. Maintaining 
consistency between more than two models has recently gained more attention, 
especially in terms of a dedicated Dagstuhl seminar [2]. The central approaches 
of multiary transformations and networks of binary transformations can be 
distinguished. In Section 1, we have discussed that multiary transformations are 
complex to specify, whereas networks of binary transformations have limited 
expressiveness [30], which does, however, not seem to be practically relevant [2]. 
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Multiary Transformations: Different approaches for multiary transformations 
have been proposed. QVT-R [22] supports multidirectionality already by design, 
but ambiguities in the standard limit practical applicability [20]. Triple Graph 
Grammars (TGGs) [26] are bidirectional specifications, which are well-suited for 
model transformations [1]. Extensions of TGGs to multiple models called Multi 
Graph Grammars (MGGs) [17] and Graph Diagram Grammars [34, 33] consider 
the specification of multidirectional rules. All these approaches, however, require 
the transformation developer to know about and be able to express the relations 
between all involved models, which we reasonably excluded by assumption. 


Auailiary Models: Not all multiary relations can be expressed by sets of binary 
ones. Adding one auxiliary model makes it, however, theoretically possible to 
express arbitrary multiary relations by binary ones [30]. Some work discussed 
which kinds of relations can be expressed with such an approach and how they 
can be formalised in the lenses framework [5, 31]. Other work discussed how 
composing such auxiliary models to express commonalities of models can be 
achieved [14]. Such auxiliary models actually encode a multiary transformation 
in a model together with binary transformations to the models to keep consistent, 
resulting in the same challenges as for transformation network. In consequence, 
our work on transformation networks is also required and applicable there. 


Binary Transformations: Although they cannot express all multiary relations, 
there are arguments in favour of using networks of modular transformations, 
especially binary ones: They are easier to develop when domain knowledge is 
distributed [13] and they are easier to comprehend by a single developer [2, 30]. 
Additionally, binary transformations are researched well and a variety of tools sup- 
porting different kinds of specifying them exist [27, 18, 25, 21]. Most formalisms 
and tools consider bidirectional transformations, whereas networks require syn- 
chronising transformations, as motivated in Section 2.2. Non-synchronising trans- 
formations can, however, be adapted to become synchronising [37]. 


Transformation Chains: Transformation chains combine transformations to 
derive low-level models from high-level ones across intermediate representations. 
Languages like FTG+PM [19] and UniTI [35] enable the specification of such 
chains. Transformation chains are, however, only a special case of general transfor- 
mation networks. Etien et al. consider specific properties of transformation chains. 
They investigate how conflicts in terms of results depending on the execution 
order can be detected [8]. These results do, however, not aim to relieve developers 
from the task of finding an execution order manually, as we do in this paper. 


Transformation Composition: Transformation composition techniques are 
a means to build networks of binary transformations. They can be separated 
into internal, white-box approaches [36], and external techniques, which consider 
transformations as black-boxes. Our contributions can be seen as an external 
composition technique. However, composition usually considers transformations 
between the same rather than different types of models. From a theoretical 
perspective (see Section 2.3) this could be treated equally by not distinguishing 
models by their metamodels. Practical approaches, however, consider transfor- 
mations between specific metamodels rather than arbitrary models. 
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Fig. 2. Example yielding inconsistent models after executing each transformation once. 
Numbers in italics indicate the order in which changes are performed. 


Execution Strategies: Di Rocco et al. [3] describe a simple strategy for or- 
chestrating transformations, but make strong assumptions requiring that each of 
them is only applied once. Stevens [30] proposes a strategy that also executes each 
transformation only once in one direction. It includes a notion of authoritative 
models, which are not allowed to be changed, and does not consider synchronising 
transformations. Likewise, Stevens [29] proposes to find an orientation model 
defining in which direction transformations are executed. If, however, several 
transformations modify the same model, the approach leaves it to the developer 
to determine an execution order after which all consistency relations hold. Such 
strategies are only correct if the network is a tree, or if no transformations interfere 
with each other. We present a simple scenario in which this is already too limiting 
in Section 4.1. We overcome this limitation by executing transformations more 
than once and thereby letting them “negotiate” a result even if they interfere, 
which yields a universal execution strategy for arbitrary network topologies. 


4 Design Space 


We approach the possibilities for designing an execution strategy by looking at 
how often it executes syncx in the worst case. We consider the two extremes of 
executing every syncx at most once and executing them an unlimited number of 
times, and find that neither of them will do: While the first one is too limiting, the 
second one cannot guarantee termination. As a consequential insight, a universal 
execution strategy needs to be conservative, introduced as contribution C2. 


4.1 One Execution per Transformation 


Several proposed strategies execute every transformation in a network at most 
once [30, 35]. Since we expect that transformations are developed independently, 
and are thus not necessarily aligned (see Section 2.2), restricting the number 
of executions to one per transformation would, however, limit the possible 
combinations of them, and models could not be kept consistent in desirable 
scenarios. We give an example for this in the following. 
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Fig. 3. A transformation network with n transformations reacting to each other. 


We use the example of Section 2.1, and focus on the UML, Java and OpenAPI 
models to consider the scenario visualised in Figure 2: An architect creates a new 
UML interface and applies an execution strategy that executes every transforma- 
tion once. First, the UML-to-Java syncx creates an appropriate interface in Java. 
The OpenAPI-to-Java syncx recognises that the interface should be exposed 
via an HTTP API and creates a matching endpoint in the OpenAPI model. 
Additionally, it creates a stub implementation with parsing and serialisation code 
in Java. The stub implementation classes can, however, not be propagated back 
to UML, because the UML-to-Java syncx has already been executed. 

We see that if we limit the number of executions to one per transformation, 
transformations cannot propagate back the changes that other transformations 
have made. However, in the context described in Section 2.2, it is necessary that 
transformations are able to “react” to the changes made by other transformations. 
This offers, for instance, separation of concerns: The logic for a certain aspect of 
consistency can be put in only one transformation and other transformations will 
propagate it throughout the network. Without such a mechanism, all aspects of 
consistency would need to be implemented in all transformations. This would 
cause duplication of logic and reduce reusability of transformations, which would 
be impractical and contradicts our assumption of independent development. If 
we added the logic for creating implementations of relevant Java interfaces to 
the UML-to-Java syncx, then it would implicitly assume the presence of the 
Java-to-OpenAPI syncx. It could, thus, not be easily reused in networks where 
the Java-to-OpenAPI syncx is not used. 

We can generalise the previous example: Let the model universe be the natural 
numbers: M = No. Let further for any 1 < j < n the syncx ij be defined as 


ae dyn 


ij sets both models to the higher number of the two, except if that number is j. 
Then ij increments the result by one. This is an abstraction of syncx “reacting” 
to each other: The ijs seek to set all models to the same value, except that after 
iji was executed, ij changes its behaviour and increments the value by one. 

We now construct the transformation network N,, for n = 2k, k € Nt (see 
Figure 3) with n indicating the number of syncx within the network, and examine 
how many executions it requires: 


(m+1,m+1) ifm= J 


f with m := max{a, b} 
m,m else 


izi ifi< 2? 
m=i fp uS a3 
t2i—n—1 else 


Nn = (H, n + 1], {0,6 + 1) |4 € [1,n]}), Tr) 
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Lemma 1. 7,, must be executed at least n times to resolve N,, with the initial 
model assignment 


Proof. The only reachable model assignment that is consistent is Mp: i > n. It is 
reached by having every aj increment the highest number in the model assignment 
by one if that highest number currently is j. All transformations incrementing 
even numbers are on one side of 7, (except for 7, itself), all transformations 
incrementing uneven numbers are on the other side. Thus, the currently highest 
number must be propagated to the other side of î„ at least n—1 times. Additionally, 
in must increment n — 1 to n. 


Theorem 1. For any execution strategy that uses O(1) executions of each trans- 
formation, there are inputs that the execution strategy cannot resolve. 


Proof. Follows directly from Lemma 1. 


The example network in Figure 2 is a simplification of a realistic transformation 
scenario, which we generalised to the network N,,. In consequence of Theorem 1, 
we can expect that transformation networks can, in general, not be resolved with 
O(1) executions of each transformation. 


4.2 Unlimited Executions 


We now consider an execution strategy that executes transformations as long as 
they still change models, and terminates once no more changes occur. This over- 
comes the shortcoming that we observed with limiting the number of executions 
to a constant; we will, however, see that we cannot guarantee termination of 
such an execution strategy. By simulating Turing machines with transformation 
networks, we prove that it is undecidable whether the strategy will terminate. 
Given a Turing machine TM over some alphabet X, we construct a trans- 
formation network Nru =: ((V, E), Trum) and a model assignment Mru,» that 
are resolvable if, and only if, TM halts on input xz € X*. We assume that TM 
contains no self-loops as well as no cycles of length 2, i.e., that each transition 
and each sequence of two transitions changes the state of TM. This is without 
loss of generality, since duplication and triplication of each state resolves such 
selfloops and cycles, respectively. The constructed models consist of a times- 
tamp, the tape content and the tape position (i.e., M = No x X* x No). The 
network Nru has TM’s states as vertices and exactly one directed edge (in arbi- 
trary direction) between each pair of states having a transition between them. 
The transformations increment the timestamp, change the tape content and 
update the tape position according to TM’s transition if, and only if, the source 
model’s timestamp is higher than the target model’s timestamp. More formally, 
let Tr(a, b) C X x {-1,0,1} x X be the transitions defined between the states a 
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and b (with —1, 0 and 1 indicating the head movements “left”, “stay” and “right”). 
We define Tru with wiper = w[0..p—1]-r-wi[p+1.. |w|—1] such that: 

V(a, b) € E : Tr(a, b) (a =: (ta, Wa, Pa), B = (to, Wo, Do)) 
(a, (ta+1, Walpat-r, Patd)) if ta > ty AA (walpal,d,r) € Tr(a, b) 
= ¢ ((te+1, wolp,cr, pp+d), B) if ta < te AA (wol[po],d,r) € Tr(b, a) 
(a, b) else 


Let s be the initial state of TM. We set 


oe ifv=s 


Mim.: UO 
a (0,¢,0) else 


, 


Lemma 2. Executing the transformations of Nyy, with initial model assignment 
Mrm,x, until no transformations change the model assignment anymore terminates 
if, and only if, TM halts on input x. If executing the transformations terminates 
with the final model assignment My, then the model with the highest timestamp 
in Im(M;) contains TM(x) as tape content. 


Proof. We can see by induction over the model assignments M;, i € No created 
while executing the transformations: 


1. There is exactly one v € V such that the model M;(v) =: (t,x, p) has the 
highest timestamp t of all models in Im(M;). 

2. There is at most one edge (a,b) € E whose transformation is inconsistent, i.e., 
(M;(a),Mi(b)) € RT.u(a,b)- This follows from the definitions of TM and the 
last executed transformation. Additionally, a = v or b = v, because otherwise 
there would have been two transformations to which models in Im(Mj_1) are 
inconsistent. We assume without loss of generality a = v. 

3. If (a,b) exists, then m’:=Mj+1(b) will contain the same tape content and the 
same tape position as would result if TM was executed one step from state v 
with tape content x and tape position p. Additionally, m’ will be the model 
with the highest timestamp of all models in Im(M;41). 

4. (a,b) does not exist if, and only if, TM would halt in state v with tape content 
x and tape position p. 


Theorem 2. Let S be an execution strategy that executes transformations until 
a consistent model assignment is reached. There are inputs for which it can not 
be decided whether S will terminate. 


Proof. It follows from Lemma 2 that deciding whether S terminates could decide 
the halting problem for a universal Turing machine. 


Even worse, this construction makes it unlikely that we will find a practicable 
criterion that ensures success of an execution strategy like we have motivated in 
Section 2.4. Because we want the criterion to apply to a single syncx, it would 
need to restrict the syncx so much that it makes building a network simulating 
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Turing machines out of the syncx impossible. But since the definition of the 
syncx in Im(Trm) is structurally simple, it seems unlikely that a syncx fulfilling 
the hypothetical criterion would still be apt for most practical use cases. 

We could avoid undecidability if we restricted the models’ size. The models 
could then no longer store an unbounded tape and, thus, only simulate space- 
restricted Turing machines. There is, however, no reasonable bound for a necessary 
model size, to which they could be limited. In consequence, determining a universal 
space bound for models would be an arbitrary and thus impractical restriction. 

Finally, one could question whether it is relevant if an execution strategy can 
be guaranteed to terminate. Execution strategies will be used to tell users whether 
changes they made can be incorporated into the other models automatically. 
In consequence, users should reliably and timely get a response. We might 
compare this situation to merging changes in version control systems. There, 
users also want a reliable and timely response on whether their changes could be 
incorporated automatically, or whether they need to resolve conflicts manually. 


5 Proposed Strategy 


As a consequence of the previous findings, every universal execution strategy will 
be conservative: there will be inputs for which it fails, even though there would 
have been an execution order leading to a consistent model assignment. In this 
section, we discuss how to find an appropriate execution order and bound, and 
finally present the “explanatory strategy”, constituting contribution C3. 


5.1 Execution Order: Providing Explainability 


Increasing the number of transformation executions an execution strategy permits, 
lowers its level of conservativeness. In contrast, the effects of different orders in 
which transformations can be executed are not as easy to categorise. The authors 
developed a model transformation network simulator [11], whose source code 
is available at GitHub [10]. It allows to construct transformation networks and 
to define execution strategies, which can be applied step by step. All examples 
presented in this paper are also modelled in the simulator. For each examined 
systematic execution order, such as a depth-first or breadth-first selection, the 
authors found categories of networks on which the order performed worse than 
another one in terms of conservativeness. In consequence, conservativeness is not 
a good sole criterion to evaluate orders by. 

We know that a universal execution strategy will inevitably be conservative, 
i.e., possibly fail for resolvable inputs. In practice, it will be important how well 
an execution strategy provides explainability in such cases, i.e., helps users to 
understand where and why the strategy failed with the selected execution order. 
The order plays a decisive role in this regard, which is why we focus on finding a 
strategy that improves the order. Imagine, for instance, that the strategy executed 
transformations in an arbitrary order until some limit is reached. Users might 
then be confronted with a situation where all transformations have been executed, 
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but the last model assignment is only consistent with some of them. There would 
be no clear pattern and little clues for users where to start investigating the 
failure’s cause. To improve explainability, the authors thus propose the following 
principle for an execution order: 


Principle 1. Ensure consistency among the transformations that have already 
been executed before executing a transformation that has not been executed yet. 


Since a syncx can change both models, executing it may results in models that 
are inconsistent with the syncx that have been executed previously. Following 
Principle 1, these inconsistencies should be addressed first. In effect, a strategy 
applying the principle will maintain a subnetwork of syncx with a consistent model 
assignment and try to expand the subnetwork transformation by transformation. 

To exemplify how Principle 1 provides explainability, suppose that an execution 
strategy applying that principle fails after having executed the set of syncx E C T. 
Let ¢ € E be the last syncx that was executed for its first time. The strategy can 
then inform users that integrating t into the subnetwork induced by E failed. 
Furthermore, it can inform users that a result that is consistent with the syncx 
in E \ {f} exists. By that, users gain valuable information for handling the error: 
First, when trying to understand the error, they can ignore any syncx that is 
not in Æ. Second, some aspect of consistency that is present in the consistency 
relation realised by t, but absent in the consistency relations realised by the syncx 
in E \ {t}, hinders the strategy from creating a consistent result. Third, when 
users try to find a consistent model assignment manually, they can start with the 
consistent result that exists for E \ {f} instead of having to start from scratch. 


5.2 Execution Bound: Reacting to Each Other 


As we have seen, we need to restrict the number of transformation executions 
with a function in w(m) (m being the number of syncx in the input network). 
Such a limit must be reasonable to support most practical use cases: Not allowing 
enough transformation executions reduces the usefulness of the strategy since not 
all useful networks can be resolved. Allowing too many executions might make 
the strategy run for a long time before aborting, without adding much value. 

In Section 4.1, we have motivated that syncx should be able to “react” to 
each other. We have seen that this excludes any bound in O(1) for the number 
of executions per transformation, but to guarantee termination we can also not 
allow transformations to react to each other indefinitely. If a syncx t changes the 
models and the other already executed syncx have reacted to those changes by 
adapting the models to be consistent with them as well, ¢ should not react by 
changing the models again. Because if ¢ changed the models again, this could 
easily result in executing the same sequences of transformations repeatedly and 
there would likely be no consistent result. 

We call transformations that behave in the described way N-converging. This 
is not a property of a syncx on its own but relative to its network N. Thus, it 
cannot be achieved just by proper construction of an individual transformation. 


100 J. Gleitze et al. 


Algorithm 1. The explanatory strategy in pseudocode. 


1 Procedure propagate (network, changes): 


2 executed + Ø 

3 accumulatedChanges + changes 

4 Invariant: accumulatedChanges applied to network consistent to executed 
5 while network.contains (candidate | candidate ¢ executed 

A^ accumulatedChanges. adjacent To (candidate)) do 

6 candidateChanges < candidate. execute (accumulatedChanges) 

7 subnetwork < network. edge/nducedSubgraph (executed) 

8 propagationChanges <— 

propagate (subnetwork, accumulatedChanges U candidateChanges) 
9 candidateChanges < candidate. execute (propagationChanges) 
10 if candidateChanges. adjacent ToAny (executed) then 
// Only happens if candidate is not network-converging 

11 fail (executed, propagationChanges) 

12 accumulatedChanges < propagationChanges U candidateChanges 
13 executed + executed U candidate 
14 return accumulatedChanges 


There is, unfortunately, also no simple way to check it statically. Nevertheless, it 
captures the sensible expectation for transformations explained above. We yield 
an execution bound for a strategy by only requiring it not to fail if all syncx 
are N-converging. We will see how this execution bound behaves in combination 
with Principle 1 in the subsequently presented execution strategy. 


Definition 8. Let N=:(G,T) be a transformation network. A syncs t € Im(T) 
is N-converging if for every initial model assignment and each subset of the 
syncx T, C Im(T) with t € T, the resulting model assignment is consistent to t 
whenever t has been executed after a sequence of the syncx in T, that contains 
each permutation of those syncz as a (not necessarily continuous) subsequence. 


We only require that the sequence of transformation executions contains each 
permutation, but allow other executions in between. As an example, assume a 
network N of N-converging syncx t1, t2 and #3. After executing them in the 
tı because tı was not executed after the order ts t2. After executing fı once more, 
the resulting model assignment must now be consistent with all syncx: tı was 
executed after the two orders of other syncx f) 3 and f3 #2. Likewise, fọ was 
executed after tı ts and ts t1, and fs was executed after tı t2 and tə t1. 


5.3 The Explanatory Strategy 


We now turn to a concrete strategy that realises the discussed design choices. 
Algorithm 1 gives pseudocode for such a strategy, which we call the “explanatory 
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Fig. 4. Exemplary execution of the explanatory strategy for a change in the topmost 
model, depicting the iterations (horizontal) and recursion steps (vertical). 


strategy”. At a high level, it acts like this: Given a changed model assignment, the 
strategy picks the next candidate syncx to execute. After executing the candidate, 
the strategy calls itself on the subnetwork formed by the already executed syncx. 
By that, it propagates the changes of the last execution throughout the sub- 
network and ensures that they are consistent with the executed syncx. Finally, 
the strategy executes the initial candidate again to ensure that the changes added 
during the subnetwork propagation are consistent with the candidate. If that 
repeated execution of the candidate generates new changes in any model that 
is kept consistent by an already executed syncx, the execution fails, because 
the candidate does not fulfil the definition of being N-converging, as we will 
see in the following. In that case, the procedure returns the already executed 
syncx to which consistency was restored by the also returned changes in order to 
support a user in examining the reasons for the strategy to fail. If the models 
are consistent with the candidate, the strategy picks the next one. In effect, 
the strategy realises Principle 1 in a recursive fashion and ensures that each 
permutation of all yet executed syncx is executed at every recursion level. 


Figure 4 depicts an exemplary execution of the strategy for a network with 
four models and four transformations. We assume that after an initially consistent 
state of the models, the topmost one was modified. We can see that each recursion 
only treats the subnetwork of previously executed transformations. Hence, the 
network gets smaller at each recursion level. 


Unlike the formalisation in Section 2.3, the presented algorithm is based on 
changes instead of model states. Changes contain information that cannot be 
recovered by comparing model states [6]. Thus in practice, we want to support 
change-based execution. The algorithm also uses changes to determine potential 
candidates for the next transformation to execute: It only picks candidates that 
are adjacent to a model that was changed. The input changes describe all changes 
that occurred since the last model assignment M that was known to be consistent. 
The procedure returns accumulatedChanges that, when applied to M, yield a 
new model assignment M’. For our formalisation, M’ is the algorithm’s output. 
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We discuss some implementation details for the explanatory strategy further 
below. First, we prove that the strategy has indeed the motivated properties. We 
assert that it terminates always and determine its execution bound. 


Theorem 3. The explanatory strategy terminates for every input. 


Proof. Because all called functions terminate, only the loop (Line 5) and the 
recursive call in Line 8 can lead to non-termination. Let m denote the number 
of edges of network. The set executed is initialised to be empty (Line 2) and 
grows by one element in every iteration of the loop. The loop is executed no more 
than m times, because after m iterations there is no transformation that is not 
in executed and, thus, the loop condition cannot be fulfilled. 

The recursive call receives a network that is smaller than network in terms of 
edges, because it does not contain the current candidate. If network is empty, 
then the algorithm will not enter the loop and not make a recursive call. Hence, 
the recursive stack never gets higher than m. 


Theorem 4. The explanatory strategy executes syncz at most O(2™) times. 


Proof. Let T(m) denote the number of syncx executions the algorithm invokes 
for a network with m edges. The set executed is initialised to be empty and 
grows by one syncx every loop iteration (Line 13). It follows that the recursive 
call in Line 8 receives a network that is one syncx larger each time. Thus, we find 


m—1 


T(0) =0, T(m) = 2m + X` T(t) =2+2T(m -— 1) = 2(2™ — 1) € O(2”) 
i=0 


Next, we show that the strategy fulfils the fundamental Requirements 1 and 2 
regarding correctness and hippocraticness, which we defined in Section 2.4. 


Theorem 5. The explanatory strategy is correct. 


Proof. Assume the contrary, i.e., that the strategy produces a model assignment 
M for network N such that M ¢ Ry. That means that there is an edge (a,b) € E 
such that (M (a), M(b)) ¢ Ry, where t := T (a,b). We distinguish these cases: 


1. was never executed. Then accumulatedChanges never contained any change 
adjacent to a or b (Line 5). Since the initial changes were relative to a 
consistent model assignment, we know that (M (a), M (b)) € Ry. 


2. t was executed and no other transformation adjacent to a or b was executed 
afterwards. Then (M(a),M(b)) € Ry per definition. 


3. t was executed and another transformation i adjacent to a or b was executed 
afterwards. Because ü was executed after {, T was in executed when t was the 
candidate. So ?’s last execution was in the recursion after i’s first execution 
in Line 6. Afterwards, i was only executed in Line 9. If & would have changed 
M(a) or M(b), the strategy would have raised a failure. Hence, M(a) and 
M(b) are the same as after the execution of t, and (M (a), M(b)) € Rr. 


All cases lead to a contradiction. 
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Theorem 6. The explanatory strategy is hippocratic. 


Proof. The strategy only produces changes by executing syncx, which, per defini- 
tion, only generate changes if the models are not in their consistency relations. 


Finally, we verify that we have indeed realised Principle 1 and that the 
strategy does not fail for a network N of only N-converging transformations. 


Theorem 7. The explanatory strategy ensures consistency among the transfor- 
mations that have already been executed before executing a transformation that 
has not been executed yet (see Principle 1). 


Proof. After the recursive call in Line 8, the current model assignment is consistent 
with all executed syncx (Theorem 5) and no changes to models adjacent to an 
executed syncx are allowed. 


Theorem 8. If the input network of the explanatory strategy consists only of 
network-converging syncaz, then the explanatory strategy does not fail. 


Proof. First, we note that when calling the algorithm on a network with m trans- 
formations, the first m — 1 iterations of the loop act identically to executing the 
algorithm on a network without the last candidate. Second, we note that the sec- 
ond part of the loop condition, “accumulatedChanges.adjacentTo (candidate)” 
(Line 5), does not change the algorithm’s result apart from controlling the order 
in which the syncx are executed. If any syncx was never executed because of 
this condition, then executing it would not have changed any model. Hence, we 
assume w.l.o.g. that all syncx in network will get executed. 

Now we show the following, stronger statement by induction over the number 
m of edges in network: “After running the explanatory strategy, the sequence 
of executed syncx contains each permutation of those syncx (not necessarily 
continuously)”. Since the transformations are network-converging and because 
of our first note above, proving this statement shows that the condition leading 
to a failure (Line 10) will never evaluate to true. The statement is trivially true 
for m=1. Assume that the statement is true for all networks of size 1 < n < m 
but not true for a network of size m. That means that after executing the last 
iteration of the loop, there is an order o of the m syncx in network in which they 
have not been executed yet. Let be the candidate of the last iteration. Let j be 
the index of ¢ in o. Per induction assumption, the order o[1]...o[j—1] has been 
executed in the previous iterations of the loop. Afterwards, t was executed in 
Line 6. Per induction assumption, the order o[j+1]...o[m] has been executed in 
the recursive call (Line 8) of the last iteration. This happened after Line 6. Hence, 
the transformations have been executed in the order o. This is a contradiction. 


The explanatory strategy only guarantees to produce a consistent model as- 
signment if all syncx are N-converging. We can, unfortunately, not provide an ap- 
proach to achieve N-convergence by construction or to determine N-convergence. 
We have, however, also discussed that every universal execution strategy needs to 
operate conservatively and thus fails in certain cases. Thus, even if a network N 
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contains syncx that are not N-converging, the explanatory strategy still operates 
conservatively and at least fails based on the notion of a sensible and well-defined 
property. In addition, the exponential worst-case performance of the strategy is 
no limitation, because it does only represent a bound to ensure termination. In 
cases in which the strategy terminates, we expect the repeated execution of each 
syncx to perform only few changes in reaction to the changes made by other syncx, 
as otherwise they are unlikely to be N-converging. The interested reader can try 
out the explanatory strategy using the previously mentioned simulator [11]. 

In its current formulation, the explanatory strategy does not prevent the 
syncx from overwriting the initial user changes. This seems inappropriate, as 
user changes should usually not be reverted. Other authors address this issue by 
forbidding changes to models that have been edited by users [3, 30, 29], called 
“authoritative models”. There are, however, practical use cases where such changes 
should be allowed—the example in Section 4.1 is one of them. An option would 
be to let the strategy fail as soon as a syncx execution overwrites a user change. 


6 Conclusion 


In this paper, we have discussed influencing factors for designing a universal exe- 
cution strategy for model transformation networks. Such a strategy orchestrates 
transformations to create a consistent set of models. It involves determining 
an order to execute the transformations in, and a bound for the number of 
executions. We have proven that every universal execution strategy that always 
terminates needs to be conservative, i.e., it will fail for certain cases in which an 
execution order of transformations that yields a consistent solution exists. We 
have argued that providing explainability in cases where an execution strategy 
fails should be a central design goal. As a result, we have proposed the explanatory 
strategy, which is proven correct and terminates for every input. Additionally, it 
improves explainability of failures and has a well-defined bound for the number 
of transformation executions to ensure a reasonable level of conservativeness. 

We have formalised our findings on execution bounds and the behaviour of 
the proposed execution strategy to prove the insights and expected properties of 
the strategy. In consequence, this paper provides fundamental knowledge about 
the design space and relevant design goals of transformation network execution 
strategies. While the statements on correctness and well-definedness are proven, 
those on the usefulness of the strategy were derived by argumentation. To improve 
evidence of the results, the authors plan to apply the strategy to realistic use 
cases, involving larger networks of more complex transformations. 

Furthermore, the authors want to examine how the strategy can be further op- 
timised: It might, e.g., be improved by backtracking and trying further candidate 
transformations, or by selecting the next candidate more carefully. Since early 
executed transformations will be executed most often, starting with those that 
will most unlikely cause conflicts might be beneficial. Finally, this paper assumes 
transformations to be binary. Since the presented strategy does not require this, 
future research could investigate transferability to multiary transformations. 
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Abstract. Software verification has recently made enormous progress 
due to the development of novel verification methods and the speed-up 
of supporting technologies like SMT solving. To keep software verifi- 
cation tools up to date with these advances, tool developers keep on 
integrating newly designed methods into their tools, almost exclusively 
by re-implementing the method within their own framework. While this 
allows for a conceptual re-use of methods, it nevertheless requires novel 
implementations for every new technique. 

In this paper, we employ cooperative verification in order to avoid re- 
implementation and enable usage of novel tools as black-box components 
in verification. Specifically, cooperation is employed for the core ingre- 
dient of software verification which is invariant generation. Finding an 
adequate loop invariant is key to the success of a verification run. Our 
framework named CoVEGI allows a master verification tool to delegate 
the task of invariant generation to one or several specialized helper in- 
variant generators. Their results are then utilized within the verification 
run of the master verifier, allowing in particular for crosschecking the va- 
lidity of the invariant. We experimentally evaluate our framework on an 
instance with two masters and three different invariant generators using 
a number of benchmarks from SV-COMP 2020. The experiments show 
that the use of CoVEGI can increase the number of correctly verified 
tasks without increasing the used resources. 


Keywords: Cooperation, Software Verification, Invariant Generation 


1 Introduction 


Recent years have seen a major progress in software verification as for instance 
witnessed by the annual competition on software verification SV-COMP [2]. This 
success is on the one hand due to advances in SAT and SMT solving and on the 
other hand due to novel verification methods like interpolation in model check- 
ing [36], automata-based software verification [31] or property directed reacha- 
bility [16]. Still, automatic verification remains a complex and error-prone task. 
In particular, it is often the case that one tool can verify a particular class 
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of programs, but fails to verify other classes (or even gives incorrect answers), 
whereas it is the reverse situation for another tool. Moreover, to keep their tools 
up to date with novel techniques, tool developers keep on integrating them by 
re-implementation within their framework. 


An approach for changing this unsatisfactory situation is cooperative veri- 
fication (for an overview see [13]). Cooperative verification builds on the idea 
of letting tools (and thus techniques) cooperate on verification tasks, thereby 
leveraging the tool’s individual strengths. In particular, cooperative verification 
aims at black box combinations of tools, using existing tools off-the-shelf without 
re-implementation. While this sounds like a natural idea, its realization poses a 
number of challenges, the major one being the exchange and usage of analysis in- 
formation. For cooperation, tools are required to produce (partial) results which 
other tools can understand and employ in their verification run. With conditional 
model checking [7], the first proposal of an exchange format for verification re- 
sults was made. A conditional model checker outputs its (potentially partial) 
result in the form of a condition which can be read by other conditional model 
checkers in order to complete the verification task. Since verification tools nor- 
mally do not understand conditions, reducers [23,9] have been proposed to bring 
conditions back into a form understandable by verifiers, namely into (residual) 
programs describing the so far unverified program part. This allows the result 
of a conditional model checker to be made usable by arbitrary other verifiers. 
A second type of existing result usage is the validation of tool’s results [4,34], 
similar to proof-carrying code [37]. Both of these types are sequential forms 
of cooperation: a first verifier starts and a second verifier continues, either by 
completing or by validating a first result. 


In this paper, we propose CoVEGI, a cooperation framework which comple- 
ments these existing approaches by a new type of cooperation. Conceptually, 
this framework (depicted in Figure 1) consists of a master verifier and a number 
of helper invariant generators. The master verifier has the overall control on the 
verification process and can delegate tasks to helpers as well as continue its own 
verification process with (partial) results provided by helpers. The helpers run 
in parallel as black boxes without cooperation. The task to be delegated is an in- 
tegral part of software verification, namely invariant generation. The framework 
allows cooperation via outsourcing the task of invariant generation, leveraging 
the strength of specialized invariant generation tools. 

Like for other types of cooperation, the question of the exchange format for 
results comes up. Here, we have chosen correctness witnesses [3] for this purpose. 
Correctness witnesses are employed in witness validation and certify a verifier’s 
result stating the correctness of a program. These witnesses are particularly well 
suited for our intended usage, because their format is standardized and a number 
of verifiers already produce correctness witnesses. To account for the incoopera- 
tion of helper verifiers not producing witnesses, our framework also foresees the 
inclusion of adapters transforming invariants into correctness witnesses. We pro- 
vide an implementation of two such adapters. Witnesses are then injected into 
the verification run of the master. For stating the task to be solved by invariant 
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Fig. 1: Cooperative verification via externally generated invariants 


generators we furthermore require mappers transforming program and property 
to be proven into a task format understandable by the helper tools. Figure 1 
depicts our framework for cooperative verification via externally generated in- 
variants. The framework can be arbitrarily configured with different masters and 
helpers, provided that suitable adapters and mappers are given. 

We have implemented CoVEGI within the CPACHECKER framework [10] 
and have employed different configurations of it as master verifier. As helpers 
we have chosen publicly available verification tools, some producing and one 
not producing witnesses. We have then experimentally evaluated 14 different 
combinations of master and helper on benchmarks of the annual competition 
of software verification SV-COMP [2]. The experiments show an improvement 
over the verification capabilities of the master tool, without incurring significant 
overhead. In some cases, the verification time is even decreased in cooperative 
verification. 


Summarizing, we make the following contributions. 


— We propose a framework for cooperative software verification based on a 
master-helper architecture using externally generated invariants. 

— We construct 14 different instantiations of the framework using 2 masters 
and 3 helpers, running both helpers in isolation as well as in parallel. 

— For the inclusion of helper verifiers, we implement two adapters, one trans- 
forming invariants expressed in the LLVM IR language! into correctness 
witnesses, the other bringing a generated witness into the right format. 

— We carry out an extensive experimental evaluation demonstrating the effec- 
tiveness and efficiency of collective invariant generation. 


2 Fundamentals 


We aim at the cooperative verification of programs written in GNU C, focusing 
on the validation of safety properties. To be able to define safety properties, a 


1 https://Ilvm.org/docs/LangRef.html 
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formal representation of programs as well as their semantics is needed. Thus we 
briefly introduce the syntax and semantics of programs which we consider here. 

We follow the notation of Beyer et al. [6] describing programs as control-flow 
automata (CFAs). A CFA is basically a control-flow graph with edges annotated 
with program statements. More formally, a program is represented as a control- 
flow automaton C = (L,lo,G), consisting of a set of program locations L, an 
initial location lọ € L and the control-flow edges G,G C L x Op x L. The set 
Op contains all possible operations on integer variables? present in the program, 
namely conditions (as of conditionals and loops), assignments, method calls and 
return statements. Figure 2(a) shows a C-program taken from the SV-COMP 
benchmarks®, and Figure 2(b) its corresponding CFA. The program also con- 
tains a special error label, used for encoding the property to be verified. The 
verification task for this program is to show the non-reachability of the error 
label at location 9, i.e., for our example program the verifier has to prove that 
y equals n after the loop which is true (since n is unsigned). 

For the semantics, we start by defining program states. Let Var denote the 
set of all integer variables occurring in programs, BExp the set of boolean ex- 
pressions and A Exp the set of arithmetic expressions over Var. Then a state o of 
the program is a mapping from the variables to the integers, i.e., o : Var > Z. 
We lift the mapping to also contain the evaluation of arithmetic and boolean 
expressions so that o maps AEzp to Z and BExp to B. A finite program path m is 
a sequence of transitions (oo, lo) 8 (04,11) +: = (On, ln), such that oo assigns 
0 to all variables, l„ is a leaf in the CFA and (li, 9;,li41) E€ G holds for each 
transition (0;, li) 2 (o;41,li41) in 7. Infinite program paths are defined analo- 
geously. As for state changes in paths: If g; is a boolean expression, method call 
or return statement, then gi = o;4, holds. If g; is an assignment x = a, where 
a € AEzp, then oj41 = cile > c;(a)]. Finally, we denote all paths of a program 
represented by a CFA C by paths(C). 


Here, we are interested in verifying safety properties of programs given as 
CFAs. For the purpose of this paper, we define a safety property P as a pair of a 
location £ € L and a boolean condition y € BExp. There can be multiple safety 
properties required to hold in a program. For our example program of Figure 2 
the property is (8,n = y). For the verifier this is encoded in the form 


8: if (!(n==y)) 
9: Error: return 1; 


A CFA (or program) © violates a safety property P = (€,~) when the pro- 


gram reaches location £ in a state which does not satisfy p. More formally, P is 
Gn=1 


violated by C, if there is some path 7 € paths(C), T = (00, lo) B (o1, h) "9 
(on, ln) and some i, 0 <i < n, such that 4; = £ and o;(y) = false. 


? In our formalization, we use integer variables only, the implementation covers C 
programs. 
3 https: //github.com/sosy-lab/sv-benchmarks 
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1 int main() { Fi Sane 
2 unsigned int n = nondet(); TA 
3 unsigned int x = n, y = 0; m 
4 while(x > 0){ a 
5 xy Moi 
‘ yrVUy 
6 yt+; } 4 n==y 5 (n==y) 
7 // Safety property x>0) 
i I == X 
8 if (!@ y)) { yt+ 5 til fo 
9 Error: return 1; } Į T 
10 seen yx- -ret 0 retl 
? 6 12} | 10 
(a) C code example (b) The corresponding CFA 
o/w o/w o/w o/w 
1,enterFunc 3,enter LoopHead A,else 8,then 
> > — q3 > (æ) o/w 


P 6,enterLoopHead 4,then 8,else 


EE OnI Ozi: 


(c) Part of the witness 


Fig. 2: An example program, its control flow automaton and one witness 


Cooperatively verifying safety of programs is achieved in our framework via 
external (loop) invariant generation. Syntactically, a loop invariant is a boolean 
expression associated to a loop head. A loop invariant needs to hold (1) before the 
first loop execution and (2) after each loop execution. The expression n = x+y, 
for instance, is a loop invariant for the program in Figure 2(a), associated to 
the loop head at location 4. This loop invariant facilitates verification, because 
in conjunction with the negated loop condition and information about initial 
variable values it ensures n to be equal to y after the loop. Other valid loop 
invariants would be x > 0 orn = 3 > y < 5, which however all do not help in 
proving the safety property. Especially the loop invariant true does not provide 
any information. Thus, we call it a trivial invariant. 

As stated before, we chose witnesses (more specifically, correctness witnesses) 
as exchange format during collective invariant generation. Formally, a witness is 
a finite state automaton in which transitions are labelled with so called source 
code guards and states can be equipped with boolean expressions. When all 
these boolean expressions are either true or false, we call the witness trivial. 
Source code guards are of the form location,type where type can be then, 
else, enterFunc and enterLoopHead. The guard o/w (otherwise) is used if a 
source code line does not match the other guards present. Via these labels we 
can match transitions of the automaton with edges in the CFA. Syntactically, 
correctness witnesses are stored in an XML format and consist of two parts: 
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(1) general information like the program associated with the witness, and (2) 
a GraphML representation of the witness automaton. More information and a 
formal specification of correctness witnesses can be found in [3]. 

In Figure 2(c), we see a correctness witness for our example program. State 
q3 is reached by transitions labelled 3, enterLoopHead or 6,enterLoopHead and 
thus corresponds to the loop head at program location 4. Associated with this 
state is the invariant n = £ + y. 


3 Concept 


In this section, we introduce our novel concept of Cooperative Verification via 
Externally Generated Invariants (CoVEGI), shown in Figure 1. The framework 
contains two sorts of main components: Master verifiers (one) and helper invari- 
ant generators (several). Next, we state some requirements on and explain the 
functionality of these components as well as their cooperation. 


3.1 Components of the CoVEGI-Framework 


The most important component of the framework is the master verifier, which 
we build out of an existing verifier. The master is responsible for coordinating 
the verification process and can, if needed, request support from the second type 
of components, the helpers, in the form of invariants as described by correctness 
witnesses. Hence, the master is also steering the cooperation. 

In the following, we explain the two sorts of main components in more detail: 


Master Verifier A master verifier gets as input the program C as CFA and a 
safety property P. It computes as output a boolean answer b, stating whether 
the property holds, and possibly (but not necessarily) provides an overall 
witness w. To be able to process the provided support in form of invariants 
stored inside of correctness witnesses, a master is required to implement an 
internal function called inject Witness. The function loads a witness, extracts 
the invariants present in it and injects them into the analysis of the master 
verifier. The witness injection can either happen before (re-)starting the 
analysis or during runtime. 

Helper Invariant Generator A helper invariant generator gets as input the 
program Č as CFA and a safety property P. It computes as output a set of 
invariants, stored in a verification witness w’. The generated invariants are 
neither required to be helpful for the master verifier nor to be correct. Thus, 
helper invariant generators are also allowed to generate trivial invariants or 
invariant candidates which might turn out to be wrong. 


We can neither expect existing verification tools which we wish to use as helpers 
to be able to work on CFAs, nor to understand the safety property or to produce 
witnesses. Hence, we foresee two further sorts of components in our framework: 
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Table 1: Overview of the configuration options available 


Name Description Values 
restartMaster restart the master after invariant generation boolean 
termAfterFirstInv use first witness only boolean 
timerM max. time for master until requestsForHelp is send time(s) 
timeoutH max. time for helpers to generate an invariant time(s) 


Mapper A mapper transforms the safety property specification inside the pro- 
gram into the desired input format of the helper. A mapper basically con- 
ducts some simple syntactic code replacements. For instance, for our running 
example some helpers might instead require the safety property to be written 
as assert (n==y); or as if(! (n==y)) {verifier_error() ;}. 

Adapter An adapter generates a correctness witness out of the computed loop 
invariants of a helper. Furthermore, some helper invariant generators work 
on intermediate representations (IR) of the C-language (e.g. LLVM) or inter- 
mediate verification languages (e.g. Boogie). Then, the computed invariants 
(formulated in terms of IR-variables) first of all need to be translated back 
to the namespace of the C-program. An adapter for LLVM is explained in 
more detail in Section 3.4. 


3.2 Cooperation within CoVEGI 


After having explained the individual components, we define their interaction 
in the framework. In this paper, we focus on the parallel execution of several 
helpers which implement complementary approaches so that we can leverage 
their individual strengths. Algorithm 1 describes the form of cooperation. It 
is steered by several user configurable options which fix aspects like time and 
resource limits of master and helpers. Table 1 summarizes the configuration 
options. We next describe them in detail. 


Master options The following aspects of the master’s behavior need to be 
fixed: First, when to delegate tasks to helpers, and second, how to continue 
the verification process after invariant generation. For the delegation, we let 
the master verifier run until it requests support, which can be checked by in- 
specting the master’s flag requestsForHelp. The master gets a configurable 
timelimit (called timerM) after which it is expected to send this request. 
By adding such an explicit request for help, we allow the master to send a 
request for other reasons (besides the timer) in the future. Then, after in- 
variant generation, the master can either be freshly restarted or continued 
(option restartMaster). 

Helper option When at least two helpers run in parallel, eventually one of 
them first computes a witness. We can then either (1) directly stop the 
other helpers, or (2) wait for all to complete before injecting witnesses into 
the master. This option is called termAfterFirstInv. 
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Algorithm 1 CoVEGI-algorithm 


Input: C > CFA 
P > safety property 
M > master 
Helpers > set of helpers 
conf > configuration 

Output: w > witness 
b > result 

1: M.start(C, P, conf.timerM); 

2: wait until (M.requestsForHelp V M.hasSolution()); 

3: if (M.hasSolution()) then 

4: return M.getSolution(); 

5: for each H € Helpers do parallel > run helpers in parallel 

6: H.start(C, P, conf.timeoutH); 

E wait until (H.timedout() V H.hasSolution() V H.stopped()); 

8: if (H.hasSolution() A nonTrivial(H.getSolution())) then 

9: witnesses := witnesses U H.getSolution(); 

10: if (conf.termAfterFirstInv) then 

11: for each H’ € helpers \{ H } do parallel 

12: H’.stop(); > stop other helpers 


13: if (M.hasSolution()) then 
14: return M.getSolution(); 


15: if (witnesses 4 J) then > invariants found 
16: if (conf.restart Master) then 

17: M.stop(); 

18: M.inject(witnesses) ; > inject witnesses into master 
19: if (conf.restart Master) then 

20: M.start(C,P, oo); 

21: join(M); > wait for M to finish 


22: return M.getSolution(); 


Timeouts Finally, similar to the master, we can set a specific timeout for the 
helpers which fixes how long they are allowed to try to generate invariants. 
The timeout option is called timeoutH. 


Next, we explain the CoVEGI algorithm shown in Algorithm 1 in detail. We as- 
sume that master and helpers run as threads and can be started and stopped. We 
furthermore employ methods wait for waiting until some condition is achieved 
and join for waiting for a specific thread to complete. 

Initially, the master verifier is started without any helper invariant generators 
running in parallel (line 1), providing the opportunity to verify programs on its 
own. It runs standalone until it requests for help (either due to not being able to 
solve the problem alone or due to hitting its timer) or until it computes a result 
which is subsequently returned (line 3). Afterwards all helpers are started in 
parallel (lines 5 and 6). They also run until they reach their timeout, a solution 
is found or they are stopped. Their solutions (invariants) are inserted into the 
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witness set (line 9). Depending on option termAfterFirstInv, either all but the 
first finished helper are stopped or it is waited until all helpers either computed a 
solution or ran into their timeout. If invariants (witnesses) have been computed, 
these are injected into the master (line 18). If the restartMaster option is set, 
the master needs to be stopped before injection and restarted afterwards. Then 
the master continues and completes its verification (without any further request 
for help) and the result is finally returned. 


Example 1. To explain the framework’s functionality, we demonstrate the CoV- 
EGI algorithm on the example presented in Figure 2(a). Assume that we instan- 
tiate the framework with a master verifier and four helper invariant generators, 
that are used in parallelt. Moreover, we configure the framework as follows: We 
set restartMaster to true, terminateAfterFirstInv to false, timerM to 50 
seconds and timeoutH to 300 seconds. 

Initially, the master verifier runs standalone and after 50 seconds runtime it 
requests help. The master runs in parallel with the four helper invariant genera- 
tors being called. Let us assume that the first helper returns only trivial invari- 
ants (after 10s), the second one the invariant n > y (after 50s), the third one the 
invariant n = x +y (after 100s) and the fourth the invariant n — x — y = 0 (after 
500s). The trivial invariant is ignored (see check in line 8) and when the second 
helper returns a solution, the third and fourth helper are still not stopped, due 
to the chosen configuration. The algorithm waits until the third helper computes 
the invariant and the fourth (only being able to compute an invariant after 500s) 
hits the timeout after 300s. Then the master is stopped, the invariants n > y 
and n = x + are injected and the master is restarted. The master verifier can 
use both invariants and might now compute the correct result. 


3.3 Witness Injection 


As master verifiers need to offer witness injection, we explain a possible pro- 
cedure for predicate abstraction and k-induction, which are the two techniques 
we use as masters during the evaluation. For both, the invariants are extracted 
from the witness and then added to the analysis information already computed 
by the master verifier. Both analyses store their analysis information in an ab- 
stract reachability graph (ARG). Broadly speaking, an ARG is a CFA equipped 
with predicates. More formally, an ARG is a finite state automaton, where nodes, 
called abstract states, consist among others of analysis information (i.e. predi- 
cates) and program locations. Two nodes within an ARG are connected if their 
program locations are connected within the CFA. Note that a program location 
may occur in multiple abstract states, e.g. when the analysis unrolls a loop. 
Hence, witness injection has to update all the abstract states for whose program 
location the witness contains an invariant. 

Predicate Abstraction. We use a predicate abstraction technique [11], 
conducting predicate refinement using a CEGAR (counter example guided ab- 


4 In [29] is is shown that more than two helpers does not practically make sense. 
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translate construct 


Fig. 3: Workflow of an adapter for an helper working on an IR 


straction refinement) scheme [20] with lazy-abstraction [33] and Craig interpo- 
lation [32]. 
Witness Injection: The predicate abstraction maintains, for each abstract state, 
one set of available predicates (called precision) and one set of valid predicates. 
Witness injection is realized by extracting all predicates and the corresponding 
locations from the invariants. If these predicates contain conjunctions of clauses, 
these are furthermore split up and inserted individually. Splitting predicates in- 
creases the performance due to the fact that SMT solvers perform better on 
many small predicates than on few larger ones’. These predicates are added to 
the precision of abstract states corresponding to the locations specified in the 
witness. Thereby, the predicates are used during the next abstraction performed 
by the analysis. The abstraction function itself guarantees that only predicates 
from the candidate set being valid at the current location are used. Thus, in- 
valid invariants are ignored. This procedure can also be used when restarting 
predicate abstraction, by adding the predicates from the witness to the initial 
precision of the abstract states corresponding to the locations specified in the 
witness (which is empty otherwise). 

k-Induction. The basic idea of k-induction [25] is to generalize bounded 
model checking (BMC) [14] via induction. After proving k-bounded program ex- 
ecutions safe using BMC, a generalization is aimed for. Therefore, it generates 
auxiliary invariants that are continuously refined using a CEGAR based analy- 
sis [5]. These invariants are combined with the information generated by BMC 
and generalized to a safety proof by successfully conducting an induction step. 
Witness Injection: For both cases, adding invariants into a running analysis or 
adding before restarting, we make use of the same idea: Whenever a witness is 
made available to the analysis, the encoded predicates and the program loca- 
tions are added as candidates to the set of auxiliary invariants, generated by 
the analysis. New elements in this set are periodically checked for validity by k- 
induction. Thereby, valid externally generated invariants are conjoined with the 
predicates stored in the analysis abstract states, corresponding to the invariants 
location. Invalid invariants are thus ignored. 


3.4 Adapter for LLVM-based Helper Invariant Generators 


Next, we exemplify an adapter for helper invariant generators working on LLVM, 
following the general construction depicted in Figure 3. Often, tools associates 
invariants to LLVM basic blocks. A basic block is a code fragment having a single 


5 This has been reported by tool developers and has also shown in our experiments. 
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entry location (the first) and a single exit location (in general the last location of 
the block). To construct a witness containing the invariants, we need to translate 
them and find the matching C-code location for the basic block. For both, we 
use the LLVM-IR equipped with debug information, using the compiler with 
launch parameter -g. Thereby, we obtain the IR-code fragment of the program 
in Figure 2(a), shown in simplified form and containing the most important 
debug information as comments. The example contains two basic blocks, entry 
and _bb. 


1 entry: 

2 vi = bitcast i32 (...)* @nondet to i32 ()* bn 
3 v2 = icmp eq i32 vi, 0 

4 br i1 v2, label %error, label %_bb 

5 

6 _bb: 

7 v3 = phi i32 [0, Zentry], [v6, %_bb] Dy 
8 v4 = phi i32 [vi, %entry], [v5, %_bb] Dx 
9 v5 = add i32 v4, -1 

10 v6 = add i32 v3, 1 

11 v7 = icmp eq i32 v5, 0 

12 br ii v7, label %error, label %_bb pline 4 


The helper invariant generator computes the invariant vl — v4 — v3 = 0 for 
the example and associates it with the basic block _bb. At first, we need to 
transform the variables from the IR to C-variables occurring in the program. 
In this example we can use the debug information, as shown in comments in 
the code. In general, a more sophisticated procedure is needed since LLVM-IR 
uses a three address code. Therein, complex expressions are split into several 
statements using intermediate variables which are resolved to C-expressions. 

Afterwards, the transformed invariant needs to be associated with the correct 
location in the C-code. We analyze the LLVM IR program structure to map the 
basic blocks back to C-locations. In the example, the block _bb is identified as 
being the loop of the program, thus the invariant is mapped to the loop head. 
For this, we employed some basic functions provided by PHASAR [41] in our 
adapter. Finally, we construct the CFA of the C-program, store the invariants 
at the nodes and convert the equipped CFA to a verification witness. 


4 Evaluation 


In the following, we evaluate different instantiations of CoVEGI. We focus on 
both effectiveness and efficiency, generally aiming at checking whether the use 
of CoVEGI can increase the number of correctly solved verification tasks within 
the same resource limits. A more detailed evaluation of CoVEGI can be found 
in an extended pre-print [29]. 


4.1 Research Questions 


In the evaluation, we were interested in the following three research questions. 
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Table 2: Summary of tools used as helpers 


Tool Techniques Mapper Adapter 
SeaHorn generation and solving constrained horn clauses x Y 
Ultimate- predicate abstraction, automata, path-based x (V) 

Automizer refinement 
VeriAbs portfolio of 4 different sequential compositions x x 


RQ1. Can collective invariant generation increase the effectiveness of the master 
verifier? Evaluation plan: We let the framework run with a single invariant 
generator and compare the results to a standalone run of the master verifier. 

RQ2. Does cooperation impact the overall efficiency of the verification? Eval- 
uation plan: We compare the run time of CoVEGI with one helper against 
the two master verifiers running standalone. 

RQ3. Does it pay off to run two invariant generators in parallel? Evaluation 
plan: We let the framework run with two invariant generators and compare 
the results to a run, where only a single invariant generator is used. 


4.2 Experimental Setup 


Tools. To be able to evaluate the performance of our framework CoVEGI, we 
instantiated it with predicate abstraction and k-induction as master verifiers and 
three helpers, using existing off-the-shelf invariant generation tools. We based 
the implementation of our CoVEGI algorithm on CPACHECKER® 1.9.1. To the 
best of our knowledge, there are no standalone and publicly available invariant 
generators, that generate invariants for both, global and local variables, without 
doing a full verification. To be able to evaluate CoVEGI, we decided to use off- 
the-shelf verifiers as invariant generators instead, by only using the generated 
invariants. We thus looked at current and past participants of the annual compe- 
tition of software verification SV-COMP [2] for invariant generation. We chose 
the tools SEAHORN [28], ULTIMATEAUTOMIZER [30] and VERIABS [1]. Both 
ULTIMATEAUTOMIZER and VERIABS achieved excellent results in this year’s 
SV-COMP, being the reason to chose them. As third tool we use SEAHORN, a 
verification tool neither currently participating in the SV-COMP nor producing 
witnesses. It operates on the LLVM intermediate representation, therefore we 
used the adapter exemplified in Section 3.4. The three helper invariant genera- 
tors are used as black-boxes and employ verification techniques complementary 
to those of both the other helpers and the two masters. An overview of the 
techniques employed in these tools is given in Table 2. The table also states 
whether the helpers require mappers and adapters. For VERIABS and ULTI- 
MATEAUTOMIZER we used the versions as used in the SV-COMP 2020’. Due to 
the fact that there is no precompiled binary of SEAHORN, we employ the docker 


ê https: //github.com/sosy-lab/cpachecker, Revision (8646a85) 
T https://gitlab.com/sosy-lab/sv-comp /archives-2020/tree/master /2020 
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Table 3: Comparison of the two master verifiers running standalone and using a 
single helper. 


Tool - k-induction predicate abstr. 
Combination alone +SH +UA +VA alone +SH +UA 4+VA 
correct overall 146 148 158 163 116 122 132 125 

correct true 102 104 114 119 78 84 94 87 
correct false 44 44 44 44 38 38 38 38 
additional true - +3 413 +19 - +6 +16 +9 
additional false - (0) 0 0 - 0 0 0 
uniquely solved 1 0 8 15 0 0 6 3 


container of the latest version’. All three helper invariant generators are used in 
their default configuration. 


During evaluation, we used the following default configurations for our own 
framework: We set termAfterFirstInv and restartMaster to true, setting the 
timerM to 50s? and the timeoutH to 300s. In general, we will use the abbrevia- 
tions SH for SEAHoRN, UA for ULTIMATEAUTOMIZER and VA for VERIABS. 

Verification Tasks. The verification tasks used are taken from the set of 
SV-COMP 2020 benchmarks!”. As we are interested in finding suitable loop 
invariants, we selected all tasks from the category ReachSafety-Loops. To obtain 
a more broad distribution of tasks, we randomly selected 55 additional tasks 
from the categories ProductLines, Recursive, Sequentialized, ECA, Floats and 
Heap, yielding in total 342 tasks. 

Computing Resources. We conducted the evaluation on three virtual ma- 
chines, each having an Intel Xeon E5-2695 v4 CPU with eight cores and a fre- 
quency of 2.10 GHz and 16GB memory, running an Ubuntu 18.04 LTS with 
Linux Kernel 4.15. We run our experiments using the same setting as in the 
SV-COMP, giving each task 15 minutes of CPU-time on 8 cores and 15GB of 
memory. We employed BENCHEXEC guaranteeing these resource-limitations [12]. 

Availability. Our tool and all experimental data are available!!. 


4.3 Experimental Results 


We implemented the CoVEGI-framework as proof-of-concept in the CPA- 

CHECKER-framework. For this, we had to extend the existing implementations 
of k-induction and predicate abstraction with witness injection. For the helper 
invariant generators we did not change a single line of code, only adding adapters 
if needed. Integrating helpers like VERIABS, not requiring an adapter or a map- 
per, can be done within a few lines of code. Although the implementation is 
a proof-of-concept, this shows that the presented framework works in practice 


8 suggested by the developers; used docker seahorn/seahorn-llvm5 (4c01c1d) 
? Which has turned out to be a preferable value, as we explain in [29] 

10 https: //github.com/sosy-lab/sv-benchmarks/releases/tag/svcomp20 

11 https: //covercig.github.io /covegi / 
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and is applicable to all kinds of off-the-shelf helper invariant generators, those 
producing verification witnesses as well as those generating invariants in IR. 

RQ1 (Effectiveness). To evaluate whether a master verifier benefits from 
the support of a helper, we execute a combination of a master and a helper 
in the default configuration and compare it to the master running standalone. 
Here, we are interested in the number of correct verification results, i.e., the 
verifier correctly reporting the safety property to be fulfilled (result true) or not 
(result false). Running standalone, k-induction can correctly solve 146 of the 
verification tasks, predicate abstraction 116. 

Table 3 gives the results of this experiment. In the table we see the overall 
number of correct results, the number of correct true and correct false results 
plus the number of tasks additionally solved when using a helper and uniquely 
solved by the configuration. Through the cooperative invariant generation, the 
performance of both masters is increased. As expected, this applies to verification 
tasks with fulfilled safety property only, i.e., the invariant generators can help in 
proving a property to hold, but cannot help in refuting properties (as they cor- 
rectly do not generate invariants in these cases). Besides the additionally solved 
tasks, there is also one (for SH and UA) and two (for VA) tasks, respectively, 
which cannot be correctly solved anymore. In these cases, the master consumes 
most of the CPU time available, hence sharing resources in cooperation with the 
helpers results in a timeout. 


On our data set, the total number of correctly solved tasks using CoVEGI 


increases by 12% for k-induction and 14% for predicate abstraction as master. 


RQ2 (Efficiency). Next, we evaluate the efficiency of CoVEGI, analyzing 
the CPU time spend solving the verification tasks. As CoVEGI eventually shares 
the CPU time between master and helpers, we expect that more time is needed 
to compute a correct result after the helper is started. 

Figure 4 shows two quantile plots of the verification runs, 4(a) with k- 
induction and 4(b) with predicate abstraction as master. A datapoint (x,y) 
in the plot means that the verifier computes the x-fastest correct results in at 
most y seconds. As CoVEGI instances behave like masters standalone in the 
first 50 seconds, we only show results not solved within these 50 seconds. We 
see that for tasks requiring a low amount of time, all instances (including the 
master alone) require a similar amount of CPU time. For tasks requiring more 
time, CoVEGI is actually often faster, the extreme being predicate abstraction 
as master which alone is unable to solve more difficult tasks in the given time. 

We exemplarily also compared the CPU time of k-induction standalone with 
CoVEGI using VERIABS as helper per task. It turns out that sharing does only 
slightly impact the runtime, as shown in Figure 5. The scatter plot compares 
the CPU time of k-induction standalone as master and k-induction supported by 
VERIABS, in case both tools solved the task correctly. A datapoint (a, y) means 
that k-induction standalone takes x seconds to solve the task and in combination 
with VERIABS y seconds. The red dashed box contains all tasks solved within 50 
seconds, where both tools behave equally, since the master does not request for 
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(b) CoVEGI using predicate abstraction as master 


Fig. 4: Quantile plots for CoVEGI using different single helpers. 


help in these cases. We see some tasks for which helping increased the runtime, 
but also some for which it decreased it. In most of the cases, the CPU time used 
by CoVEGI is not significantly higher. 

Finally, we compare the average CPU time needed to correctly solve a task. 
Table 4 shows the average time needed for all tasks and — in brackets — for the 
correctly solved tasks only. We observe that the runtime increases when only 
looking at correctly solved tasks (in particular for VERIABS), however, when 
considering all tasks the CPU time is even decreased. The latter effect is due to 
the number of timeouts of the master decreasing when cooperating with helpers. 
Concluding, we can make the following observation. 


On our dataset, collaborative invariant generation does not negatively impact 


the effectiveness; in some cases we even see small improvements. 


RQ3 (Combination of helpers). In RQ3, we were interested in finding 
out (a) whether it is beneficial to run two invariant generators in parallel, and 
(b) if yes, which pair is best for this. We thus studied the number of correctly 
solved tasks using the three possible pairs of helpers, each running two helpers 
in parallel. Table 5 shows the results. 
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Fig. 5: Scatter plot for kInd and kInd-VA 


Table 5: Number of correctly solved tasks using different forms of cooperation 
with two or three helpers running in parallel. 


Master +SH-UA +SH-VA +UA-VA +SH-UA-VA 
k-induction 153 156 163 154 
predicate abstr. 130 130 136 129 


For checking whether parallel execution of helpers is beneficial, these num- 
bers need to be compared against those for a single helper as given in Table 3. 
We see that predicate abstraction benefits from using two helpers, especially 
using ULTIMATEAUTOMIZER and VERIABS. Using CoVEGI with these tools 
perfectly combines their strengths, thereby increasing the number of correctly 
solved tasks in total by 17%. In contrast, it turns out that for k-induction none 
of the combinations of two helpers outperforms CoVEGI using VERIABS only. 
For ULTIMATEAUTOMIZER and VERIABS as helpers, the total number does not 
change, only the set of solved tasks. For instance, nearly 50% of the additional 
tasks solved by kInd-UA-VA are not solved using kInd-UA and vice versa. This 
result is based on the fact that they have to share the available CPU time in 
the combination. Hence, tasks that are solved using one of them as helper alone 
could not be solved anymore in a combination because of timeouts. This phe- 
nomenon is even more an issue when running all three helpers in parallel. 

The combination of all three helpers solves only 154 tasks correctly for k- 
induction and 129 for predicate abstraction. In addition, we evaluated different 
values for parameter timeoutH in [29], whereas it turns out that waiting for all 


? 


helpers to finish does not increase the number of correctly solved tasks. 


On our dataset, CoVEGI can increase the total number of correctly solved 


tasks using UA and VA in parallel; in general waiting for the other tool to 
also finish its computation does not pay off. 
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4.4 Threads to Validity 


We have conducted our evaluation using a random sample of tasks as well as 
those in the category Loops. Although this guarantees some diversity, our find- 
ings may not completely carry over to arbitrary real-world programs. 

The experiments are conducted using the reliable framework BENCHEXEC 
on identical machines with same resource limitations, guaranteeing comparable 
results. As SEAHORN is used within a docker-container, its CPU usage however 
cannot be measured by BENCHEXEC. We therefore measured this externally, 
rounded it up and added it to the measured CPU time, obtaining a lower bound 
for the correctly solved tasks. Thereby, all results stay valid, especially of the 
best performing instantiations of CoVEGI, as they do not use SEAHORN. 

Our implementation of CoVEGI relies on the correctness of the used mas- 
ter verifiers and helpers (which are given) as well as on the adapters (which 
we build). An incorrectly translated invariant may however influence the per- 
formance only negatively. Both master verifiers used as well as ULTIMATEAU- 
TOMIZER and VERIABS are participating in the annual SV-COMP, hence they 
might be tuned to the tasks employed. This does however not influence the va- 
lidity of the results since our interest is in the additional number of tasks solved 
by cooperation, not the solved ones per se. 


5 Related work 


In this paper, we presented a framework for cooperative verification via collec- 
tive invariant generation. The idea of collaboration for verification by combin- 
ing known techniques has been widely employed before. For instance, there are 
combinations of verification with testing approaches [21,22,26,18,19,24] and with 
approaches for invariant generation [40,27,39,15,17]. The latter combinations are 
conducted in a white box manner using strong coupling between the components, 
making the addition of a new approach a challenging task. Our framework con- 
ceptually decouples the invariant generation from the verification, making it 
more flexible. In addition, using a black box integration with defined exchange 
formats allows us to easily exchange or integrate new approaches. 

There are also existing concepts for collaboration between different tech- 
niques in a black-box manner. Conditional model checking is a technique for 
sequentially composing different model checkers, sharing information between 
the tools in form of conditions [7]. Beyer and Jakobs developed a concept for 
combining model checking with testing [8]. Although both approaches enable co- 
operation, none combines a verification tool and tools for invariant generation. 

We next shortly discuss three approaches which are conceptually closer to 
our framework. Frama-C is a framework for code analysis, aiming for analyzing 
industrial size code [35]. The framework contains different plugins, each imple- 
menting a verification or testing technique. The plugins can exchange informa- 
tion in form of ASCL source code annotations. Within Frama-C, the analyzers 
can collaborate by being either sequentially or parallelly composed. For this, par- 
tial results produced by an analysis can be completed by a second one or several 
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partial results computed in parallel are composed to a complete result. Frama- 
C offers the general possibility to define cooperation between existing plugins. 
To the best of our knowledge, Frama-C does however not provide a conceptual 
collaboration of a verification approach and tools for invariant generation driven 
by the verification approach’s demand for support. 

The approach of using continuously refined invariants for k-induction [5] uses 
a lightweight dataflow analysis which can be considered to be a helper for ver- 
ification. Therein, the supporting invariant generator runs in parallel to the k- 
induction analysis. Compared to our framework, the main difference is the form 
of cooperation used. Beyer et al. use a white-box integration for the cooperation 
between k-induction and the invariant generator, building hardly wired connec- 
tions between both analyses and sharing the information inside the tool. Thus, 
integrating external tools is hard to achieve. Moreover, the approach is designed 
to work for k-induction only. Note that an analogeous approach is proposed by 
Brain et al. [17]. 

Pauck and Wehrheim proposed CODIDROID, a framework for cooperative 
taint flow analysis for Android apps [38]. Within their framework, different 
analysis tools with specialized capabilities are combined as black-boxes. Co- 
DIDROID is however tailored to the needs of Android taint flow analysis, thus 
the exchanged information differs. Thus CODIDROID is not able to orchestrate 
or exchange information on safety analysis with shared invariant generation. 

To summarize, there are a lot of existing approaches for cooperative verifica- 
tion, but most of them are white-box combinations, and the existing black-box 
combinations are not general enough to allow for collective invariant generation. 


6 Conclusion 


In this paper, we have presented a novel form of black box cooperation for 
software verification via externally generated invariants. Within the configurable 
framework named CoVEGI, the so called master verifier steering the verification 
process is able to delegate the task of invariant generation to one or several 
helper invariant generators. 

We implemented CoVEGI within the CPACHECKER framework using k- 
induction and predicate abstraction as master analysis supported by three exist- 
ing helpers SEAHORN, ULTIMATEAUTOMIZER and VERIABS. Our evaluation on 
a set of SV-COMP verification tasks shows that CoVEGI increases the number 
of correctly solved tasks without increasing the overall verification time. The 
best combination of helpers, ULTIMATEAUTOMIZER and VERIABS in parallel, 
yields an increase of 12% for k-induction and 17% for predicate abstraction. 

Next, we plan to enhance the cooperation by analyzing the behavior of the 
master in order to identify an optimal point to request for help. Moreover, ex- 
tending CoVEGI by additionally taking error traces found by the helper into 
account is also scheduled. In addition, we intend to investigate whether a selec- 
tion of helpers on the basis of the given verification task is beneficial. 
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Abstract. Security attacks present unique challenges to self-adaptive 
system design due to the adversarial nature of the environment. Game 
theory approaches have been explored in security to model malicious 
behaviors and design reliable defense for the system in a mathematically 
grounded manner. However, modeling the system as a single player, as 
done in prior works, is insufficient for the system under partial compromise 
and for the design of fine-grained defensive strategies where the rest of the 
system with autonomy can cooperate to mitigate the impact of attacks. 
To deal with such issues, we propose a new self-adaptive framework incor- 
porating Bayesian game theory and model the defender (i.e., the system) 
at the granularity of components. Under security attacks, the architecture 
model of the system is translated into a Bayesian multi-player game, 
where each component is explicitly modeled as an independent player 
while security attacks are encoded as variant types for the components. 
The optimal defensive strategy for the system is dynamically computed 
by solving the pure equilibrium (i.e., adaptation response) to achieve 
the best possible system utility, improving the resiliency of the system 
against security attacks. We illustrate our approach using an example 
involving load balancing and a case study on inter-domain routing. 


1 Introduction 


A self-adaptive system is designed to be capable of modifying its structure and 
behavior at run time in response to changes in its environment and the system 
itself (e.g., variability in system performance, deployment cost, internal faults, 
and system availability) [9,12]. One of the major challenges in self-adaptive 
systems is managing uncertainty; i.e., the system should be capable of making 
appropriate planning decisions despite limited observations about its environment. 
Achieving security in presence of uncertainty is particularly challenging due to 
the adversarial nature of the environment [17,13]: (1) to avoid detection, a typical 
attacker may attempt to remain hidden while carrying out its actions, and so 
accurately estimating its objectives and capabilities can be difficult, and (2) the 
attacker actively attempts to cause as much harm as possible to the system, and 
so a typical “average case” analysis may not be appropriate for making optimal 
defensive decisions [28]. 


© The Author(s) 2021 
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Various game-theoretic approaches have been explored in the security com- 
munity for modeling interactions between the system and attackers as a game 
between a group of players (i.e., system and multiple attackers, each as one 
player) and computing optimal strategies (i.e., Nash Equilibrium) for the system 
to minimize the impact of possible attacks and improve its resiliency against 
them [40,15,19,28]. These methods can be used to (1) model adversarial behaviors 
by malicious attackers [19], and (2) design reliable defense for the system by using 
underlying incentive mechanisms to balance perceived risks in a mathematically 
grounded manner [15]. In particular, a type of game-theoretic method called 
Bayesian games [25] is designed to explicitly encode and reason about uncertainty 
in the information that players have (e.g., partial knowledge about each other’s 
actions and objectives). 

Prior works in security that leverage game theory [40,15,19,28] have treated 
the system as an independent player (i.e., defender) in the game. However, such 
a monolithic approach that involves abstracting the entire system as a single 
player might be insufficient for capturing certain practical scenarios, where only 
one part of the system is compromised while the remaining system components 
may co-operate each other to mitigate the impact of an on-going attack. 

In this paper, we argue that compared to a coarse one-player abstraction 
of a system, modeling the defender under security attacks at the granularity of 
components is more expressive, in that it allows the design of fine-grained defensive 
strategies for the system under partial compromise. In particular, we advocate 
a security modeling approach where an attack is modeled as the anomalous 
behavior of a system component that deviates from its expected behavior, as an 
alternative to a conventional approach where attackers themselves are modeled 
as separate players. 

To this end, we propose a novel approach to improving the resiliency of 
self-adaptive systems against security attacks by leveraging game theory. In 
particular, we propose a new self-adaptive framework that leverages multi-players 
Bayesian games at the granularity of components at the system architecture 
level. Specifically, in our approach, each major system component is modeled 
separately as an independent player. Under an attack, one or more components 
with vulnerabilities might be exploited by an attacker to deliberately perform 
harmful actions (i.e., turning into a malicious type). Different types of attacks 
that these components might be subject to are encoded as different types of game 
players, encoding uncertainty in the attack being carried out. The rest of the 
components are then modeled as forming a coalition to mitigate the impact of 
the malicious actions by those compromised components. 


To perform a security analysis, a model of the system architecture and 
component attacks are translated into a mathematical Bayesian game structure. 
Then, the adaptive defensive strategy for the system is dynamically computed 
by solving a pure equilibrium, to achieve the best possible system utility under 
all assignments of the components to their possible types (i.e., in the presence of 
security attacks). 


Our main contributions are summarized as follows: 
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— A self-adaptive framework that incorporates Bayesian game theory to improve 
the resiliency of the system under potential security attacks; 

— An approach to modeling the system under attacks as a multi-player game 
with potentially uncooperative players at the granularity of components and 
the use of equilibrium as an optimal adaptation response; 

— A demonstration of the applicability of our approach through an example 
with load-balancing scenarios and a case study involving a network routing 
application with a proposed dynamic programming algorithm. 


2 Background 
2.1 Running Example 


As a running example, we adopt Znn.com, a 


hypothetical news website that has been used 
as a representative system for the application Ea 
of self-adaptive systems [10,11]. In a typical 
workflow, given a request from a client, the opo- 
web server fetches appropriate content (in Server2 
form of text) from its back-end database and FoadBalancer 
k 

generates a web page containing a visualiza- Ea 
erver, 


tion of the text. Furthermore, the system also 
provides an optional service with multimedia 
content (e.g., images, videos). This service involves additional computation on the 
server side, but also brings in more revenue compared to the requests with only 
text. With Rm and Rr being the revenue, Cm and Cr being the computation 
of one response to a user request with the media content and with only text 
content, respectively, we assume that Rm > Rr > 0 and Cy > Cr > 0. 

In order to support multiple servers, a LoadBalancer is added to distribute 
the requests from the users to a pool of servers, as shown in Figure 1. The cost 
of each server is proportional to its load due to, such as potential high response 
time since companies such as Amazon, eBay, and Google claim that increased 
user perceived response time results in revenue loss [33]. To be more specific, 
the cost per server is denoted by (S; — T)?/K where S; is the current occupied 
load for server i, depending on the request serving mode (i.e., S; = D;Cr in text 
only while S; = D;Cy, in multi-media mode where D; is the number of requests 
distributed to server i); T is the threshold beyond which the response time would 
be affected; K is a constant used to adjust the cost ratio. 

The goal of the self-adaptive system is to maximize the difference between 
revenue and cost. 


Fig. 1: Running Example. 


3 
U = Ruru + Rrer- Y (Si <T? 0 : (5; — T)?/K) (1) 
i=l 
where zm and zr are the numbers of responses with media and text content, 
respectively; the penalty is the sum of the cost for all three servers. 
Suppose that some of the servers are vulnerable to various attacks such as 
password guessing, SQL injection, command injection, etc [1]. The information 
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collected from the web server, however, cannot fully demonstrate its compromise 
due to, e.g., the deficiencies of scanning tools, but with uncertainty. As shown in 
the Figure, Server2 could be potentially attacked with a 20% probability while 
Server is with a higher probability of 50%. These two servers, if compromised in 
reality, might perform harmful actions controlled by the attackers to achieve their 
objectives, rendering the loss of system reward. Here we assume the malicious 
strategies of simply discarding all the distributed user requests. The reward of 
attacks is denoted by the system loss, i.e., subtracting the maximum reward the 
system could achieve from the reward under attacks, leading to a zero-sum game. 


2.2 Bayesian Game Theory 

Game theory is the application of mathematical analysis of individual and coop- 
erative behaviors between players that follow a certain strategy to satisfy their 
self-interests [21,38]. A Bayesian game is a type of game in which players have 
incomplete information about the other players [25]. For example, a player may 
not know the exact type (e.g., malicious or good) associated with a unique payoff 
function of the other players, but instead, have beliefs about these types. These 
beliefs are represented by a probability distribution over the possible types. More 
formally, Bayesian games or incomplete information games are defined as follows: 


Definition 1. A Bayesian game is a tuple BG = (P, A, 0, U, p) 


— A set of n players P; 

— A set of (joint) actions A= A, x ... x An, where A; denotes a finite set of 
actions available to player P;; 

— A set of types for each player i : 0; € O;; 

— A payoff function for each player i : uila1, ..., an; O1, ...,0n), determined by 
the types of all players and actions they choose; 

— A (joint) probability distribution p(61,...,4n) over types. 


Importantly, throughout the Bayesian games, we assume that the assignment 
of types to players is private information, while the priori type probability 
distribution, the action spaces and the payoff functions are assumed to be common 
knowledge. A player’s strategy can be pure (i.e., take a deterministic action) or 
mixed (i.e., randomly choose an action according to some probability distribution). 
A strategy for player 7 is s; : O; x Aj > [0,1], and V@ € O;, X aca, si(al@) = 1. 
The strategy is pure if it satisfies that V0 € O;, Ja € Aj, s;(a|@) = 1, also denoted 
as $j: O; = Aj. 


Definition 2. (Bayesian Nash Equilibrium Strategy) Given a joint strategy for 
all players s* = |sj,..., 8%], &* is the Bayesian Nash equilibrium strategy if for 
any player i, it satisfies that: 


st =arg max XO p(6_;|0;)Ea_,~s* ,,ai~s; [uila E-i; 0i, O-1)] 
8,€S(0;) k 
where äi = lar, --+) Qi—1; Qi41, +++) On], ĝi = (Ois Oits litisin] 5% = 


* 


[si 571 Sih eo 8%], S(0:) is the set of all possible strategies for agent i under 
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0i, and p(0_;|9;) is the conditional probability representing the player i’s belief 
about other players’ types under type 0i. 

Bayesian Nash equilibrium is a set of strategies, one for each type of player. It is 
the best strategy that maximizes his or her payoff to other players’ equilibrium 
strategies. In a Nash equilibrium, there is no player who can improve his profit 
by unilaterally modifying his strategy if the actions of the rest are fixed [25,21]. 


3 Self-Adaptive Framework Incorporating Bayesian 
Game Theory 


Security attacks are usually asso- 
ciated with a high degree of uncer- Self-Adaptive Systems 
tainty where the defender may know 
little about the identity of the at- Analyzer Panar 

tackers nor fully understand their ep 

7 Sy» compromise { - 

technical effect on the system. A ey Probability 
Bayesian game is a game in which 


Managing Subsystem 


Bayesian Game 


players have incomplete information eee 


about the other players, appropriate Monitor ined Executor 
o S 
for modeling and dealing with the (e) ae model PT 
attacks with uncertainty. In this sec- Sensors saemponent ee Actuators 
tion, we propose a new type of self- 

adaptive framework incorporating 
Bayesian Game. Adaptation behav- 
iors build on the Nash equilibrium 
from unexpected attacks and are 
achieved by elaborating the widely 
adopted mechanism of the MAPE- 
K (Monitoring, Analysis, Planning, Execution, Knowledge) loop [27,43], shown 
in Figure 2. 

Knowledge. Knowledge Base requires the system developers or domain experts 
to specify (1) the component and connector model of the managed subsystem 
and its action space for each component, (2) system objectives usually defined as 
the quality attributes quantified by the utility, and (3) component vulnerabilities 
with potential behavior deviations that can be exploited by the potential attacks. 
Other necessary information such as the history information of system behaviors 
and environment information are saved in Knowledge Base and can be updated 
for the sake of self-adaptation. 

Monitor. Events generated in the managed subsystem or environment indicating 
the execution of system actions or natural changes in the environmental factors 
are received. Monitor gathers and synthesizes the on-going attacks information 
through sensors and saves information in the Knowledge Base. For our example, 
events such as plenty of user request loss or command injection can indicate a 
potential attack on the web server. 

Analyzer. During speculative analysis, conditions of the environment /managed 
subsystem representing violations or better satisfaction of goals that can arise 


Managed Subsystem 
67S 
Loa 
oO 


Environment Dynamics 


Fig. 2: Self-Adaptive Framework. 
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based on the input from Monitor are identified. The Analyzer performs analysis 
and further checks whether certain components are attacked with probabilities; 
potential deviated malicious actions are identified; the rewards for the attack 
are estimated, based on the knowledge about component vulnerabilities and 
system objectives. Such attack probabilities can be analyzed with a statistical 
combination of all feasible scenarios along with expert judgment [16,24]. A typical 
example is that both Server2 and Server3 are analyzed to be compromised and 
discarding user requests with a certain probability, reducing the system utility. 
Planner. Planner generates a workflow of adaptation actions aiming to counteract 
violations of system goals or better achieving goals. It consists of one or a set 
of actions to be enacted by automatically solving the Multi-player Bayesian 
Game transformed with the input of potential attacks from the Analyzer and 
architectural model of the managed subsystem along with the system objectives, 
which is elaborated in Section 4. For each security situation, it generates an 
equilibrium if one exists as the adaptation to respond to unexpected attacks, 
or prompts for a change in the design of the system if the violation cannot be 
handled. Distributing more percentage of a user request to the normal server 
while decreasing the percentage to those with a high probability of compromise 
as well as adjusting the fidelity level for servers could be feasible actions for 
Zun.com Website under security attacks. 

Executor. During execution, the strategies from the adaptation equilibrium are 
enacted on the managed subsystem through actuators. Typical examples could 
be setting the distribution percentage of user percentage in LoadBalancer for 
each server. 

In the next part, we focus on planning activity with Bayesian game theory. 
We assume adequate monitoring in place, sufficient analysis methods on potential 
attacks with uncertainties based on observation and historical information, as 
well as an execution environment through which selected adaptation strategies 
are enacted. 


4 Bayesian Game Through Model Transformation 


In this section, we start by defining the system under attacks and transforming the 
system architecture and on-going attacks into a component-based multi-player 
Bayesian game. Solving the game with equilibrium is to find the adaptation 
strategy. Then, we present the analysis results on our running example. 


Component-based System. A system component is an independent and re- 
placeable part of a system (e.g., a process, program) that fulfills a clear function 
in the context of a well-defined architecture. Typical examples are the LoadBal- 
ancer and servers in Figure 1. Components forming architectural structures affect 
different quality attributes. For example, quality attributes of user satisfaction 
(i.e., revenue) and the costs (i.e., penalty) identified in the Znn Website example 
are influenced by the actions of all four components and characterized as utility 
functions as shown in Eq.(1) mapping them to utility values. 


Definition 3. A system can be formally defined as a tuple S = (C, A, Q). 
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— C is a set of components; 

— A is a set of joint actions A= A, x... x An, where A; denotes a finite set 
of actions available to component i; 

— Q is a set of quality attributes a system is interested in; for each Q,, a subset 
of components SubC, C C could contribute to this quality attribute; 


Each component is trying to make the right reaction to maximize the system 
utility, essentially like a rational player in the game theory. Naturally, a system 
under normal operation could be viewed as a cooperative game dealing with 
how coalitions interact. Each component is denoted as an independent player 
and these interacting components/players form a coalition. For instance, in the 
running example, the LoadBalancer and three servers collaborate to achieve the 
goals together, i.e., maximizing the system reward with revenue and penalty. 
Specifically, the LoadBalancer should assign more user requests to those servers 
with low computation usage, like the waiting queue in the bank, while the server 
should adjust the fidelity level according to its current load. A high load may 
lead to the text only content to decrease the cost while the server with low usage 
can provide media content to promote the revenue. 


Modeling Utility as Payoffs. The payoff among those players is allocated 
by the utility from quality attributes. It is straightforward for developers to 
design a system-level payoff function (e.g., the revenue and penalty in Section 
2.1). However, due to the different roles of the components and the complex 
relationship between them, it is complicated and sometimes untraceable to 
manually design an appropriate component-level payoff function. To solve this 
problem, we use the Shapley Value Method, a solution concept of fairly distributing 
both gains and costs to several players working in coalition proportional to their 
marginal contributions [37,36], to automatically decompose the system-level 
utility into the component-level payoff. Shapley Value Method applies primarily 
in situations when the contributions of each player are unequal, but each player 
works in cooperation with each other to obtain the payoff. Given the component 
set C, and a system-level utility function v, the payoff for a component i is: 


oi(C,v) = a dS IEICE- Ie- euih - o(C)] 2) 

“crco\ fi} 
where |C] is the number of components in the set; C\{i} is the set C excluding 
component i; v(C’) values the expected system-level utility when the system only 
consists of the component set C’. 

The following is a typical example of system utility allocation with the 
Shapley Value Method for the Znn website. To simplify the illustration, we 
consider the situation where Server? and Server’ are indeed compromised, the 
LoadBalancer chooses the strategy equally distributing user requests to Server1 
and Server? (i.e., the requests distributed to Server1, Server? and Server3 are 
50, 50 and 0 respectively), and Server! selects the text only mode. Besides, 
the total unprocessed requests in the setting are 100, which is assumed to be 
the full load of a server serving only text, with Rm = 1.6, Rr = 1,T = 50, 
and K = 25 in Eq.(1). The computation capacity of a unit of text and media 
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is 1 and 1.4 (i.e., Cm and Cr) respectively. Thus, the system utility in this 
situation is Usystem = 50 (ie., 50 x 1 — (50 x 1 — 50)?/25 with the remaining 50 
requests discarded by malicious Server2). The cooperative player set consisting of 
LoadBalancer and Server! share this utility while Server2 and Server’ fight on 
behalf of the attacks’ interests, thus not being considered in the coalition neither 
allocated the payoff from the system utility. 

Based on Eq.(2), we need the following two cases of coalitions for Shapley Value 
calculation: (1) If there is only the LoadBalancer without Server! in the coalition, 
the utility of the system UzoadBalancer iS 0 due to no requests process from Server1 
neither from malicious Server2; (2) If there is only Server! without LoadBalancer 
distributing user requests, the requests are randomly passed among three servers, 
i.e., the requests distributed to Server1, Server2 and Server3 are 34, 33 and 33 
respectively, and the utility of the system for this coalition Userver1 is 34 (ie., 34x 
1—0). This is because malicious Server2 and Server3 do not return any feedback. 
Asa result, PLoadBalancer(C, v) = 1/2(U system — Uservert + Utoadbalancer ) = 8 and 
@Servert (C,v) = 1/2(U system — ULoadBalancer + U serveri) = 42. Therefore,the 
payoff to player LoadBalancer and Server1 are 8 and 42 respectively. Meanwhile, 
attacks’ utility, the difference between system utility and the highest utility the 
system could achieve without attacks (i.e., equally distributing user requests to 
three servers and each server choosing multi-media mode in this setting with 
value 160 = 100 x 1.6 — 0) is equally divided for two malicious players. In other 
words, both Server2 and Server8 is allocated payoff 55 = (160-50)/2. Following 
the aforementioned allocation process, each player obtains a unique payoff under 
different attack situations and strategies from the Shapley Value Method based 
on their roles contributing to marginal system utility. 


Component-based Attacks. A system under security attacks is also defined as a 
tuple SAS = (C, A, Q, ATT). Instead of modeling an attacker or several attackers 
with possible complex behaviors over different parts of the system, we model the 
on-going attacks ATT the system is enduring at the component level since the 
vulnerabilities of the components as well as their potential behavior deviations 
are comparatively easy to observe. ATT can be obtained by synthesizing the 
information from Monitor and Analyzer as described in Section 3. 


Definition 4. The security attacks on the system is formally defined as a tuple 
ATT = (Catt, Aatt, Patt» Ratt)- 


— Catt is the set of components affected by the attacks; 

— Aate = Aatir X -- X Aattm where Agi; denotes the set of actions controlled 
by attacks on compromised component 1; 

— Patt = {p1,.--;Pm} is a set of probability where pi is the probability of compo- 
nent i being successfully compromised; 

— Ratt is the reward for attacks. 


Translation into a Bayesian game With the definition of the system on the 
component level and the definition of the attacks ATT, a system under security 
attacks is converted into a non-cooperative Bayesian game by the following steps: 
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1. Each component in the system c € C, such as LoadBalancer and three servers 
in the running example, is separately modeled as an independent player; 

2. The components potentially affected by attacks Catt C C is associated with 
two types (e.g., Server2 and Server3 can be normal or malicious in the 
simplified Znn website scenario) while the remaining components C — Catt, 
i.e., LoadBalancer and Server1, are deterministic in normal type; 

3. The probability distribution for a player į over two types is p(pi, 1 — pi) as 
defined in Patt. One typical example for Server? is p(0.8,0.2) and for Server3 
p(0.5, 0.5); 

4. The action space of player 7 under security attacks is the union of both 
its normal actions and those malicious actions controlled by attacks (i.e., 
A; U Aatti). Server? can serve user requests either with text only or multi- 
media content as a normal player, or maliciously discard them with the 
intention of attacks; 

5. The payoff for players in normal type is allocated with system utility by 
the Shapley Value Method, while components in malicious type performing 
harmful actions is assigned with utility the on-going attacks obtain by achiev- 
ing their own goals. This assignment could be simple average distribution 
or Shapley Value Method if the malicious players are treated as another 
coalition; 

6. The game constructed is put into a game solver, to find a Nash equilibrium, 
which, in essence, is the best reaction for the system to potential attacks. 


Note that this definition can be easily extended for the situation where a compo- 
nent is simultaneously compromised by different attackers with multiple types. 
Besides, the game solver we adopted in this work is Gambit [35], a collection of 
tools for building game models, computing game equilibrium and analyzing game 
results, to efficiently model the Bayesian game translated by the above steps and 
automatically figure out the equilibrium strategy as the adaptation response. 


4.1 Analysis Results for Znn.com Example 


In this subsection, we demonstrate how our approach can produce adaptation 
decisions under security attacks for Znn website to enhance the system utility. In 
particular, we exploit the Bayesian game model by following the aforementioned 
steps and generate the equilibrium. To explore different attack scenarios, we 
statically analyze a discretized region of the state space, which is projected 
over two dimensions that vary the malicious probability (i.e., probability_S2 and 
probability_S3) of Server2 and Server3 respectively (with values in the range 
(0, 1]). Each state of the discrete set requires a solution of the game with the 
Nash Equilibrium that quantifies the best utility the system could obtain. The 
experiment takes less than one minute to generate all the results, as shown in 
Figure 3, and for each state, the solution generation time is negligible. To set 
up the experiment, we assume there are 100 user requests - the maximum load 
of a server in text only mode - with Ry = 1.6, Rr = 1, ry = 1.4, zr = 1, 
T = 50, and X = 25 in Eq.( 1). Additionally, we adopt the probabilistic model 
checking method as the benchmark [11,7,32] and compare our Bayesian Game 
theory method with it in terms of the system utility. 
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Figure 3 (a) illustrates the percentage of user requests distributed to Server! 
from the strategy for the LoadBalancer in equilibrium. As expected, the percent- 
age of Server1 increases progressively with the increasing malicious probability 
of Server2 and Server3 as more user requests are supposed to be processed 
by a server under normal operation. In particular, we observed that the user 
percentage is around one third when both Server2 and Server3 are functioning 
normally (i.e., both probability_S2 and probability_S3 are 0), with LoadBalancer 
equally delivering the user requests since none of the servers is compromised. 
Moreover, the percentage for Server! reaches around 84% when the other two 
servers are fully compromised. In this situation, LoadBalancer does not deliver 
all user requests to Server1; otherwise Server? may be overloaded with the 
increasing costs due to high response time which in turn outweigh its benefits of 
request processing. 
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Fig. 3: Results for Znn Website: (a) percentage of user requests to Server1; (b) 
percentage of user requests to Server2; (c) strategies for Server1; (d) system 
utility with game theory approach; (e) delta utility between Bayesian game theory 
approach and probabilistic model checking approach. 


Figure 3 (b) describes the percentage of user request that LoadBalancer 
delivers to Server2 in the equilibrium. We can also observe that user requests 
to Server2 are negatively proportional to its malicious probability. Particularly, 
user requests are 50 when probability probability_S2 is 0 while Server’ is fully 
malicious (i.e., probability.S3=1) where LoadBalancer should equally distribute 
the user request to both Server! and Server2. Figure 3 (c) presents the strategy 
in equilibrium for Server1. The states in which text content is provided are 
indicated by red triangles, whereas the multimedia strategies for Server1 are 
denoted by white rectangles. As we can see, red points are in the upper right 
corner where malicious probabilities of Server2 and Server are greater than 50%, 
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which means that they are very likely compromised. Therefore, LoadBalancer 
distributes as many user requests as possible to Server1, thus Server1 choosing 
to provide text only content in avoid of overloading. Otherwise, Server1 can 
provide multimedia content in less load condition to promote user satisfaction 
with higher revenue. 

Figure 3 (d) illustrates the maximum utility the system can achieve under 
various attack situations. In particular, we observe that the utility reaches around 
160 when all three servers are cooperative and is progressively decreased with 
the increasing malicious probability of Server2 and Server3. This is consistent 
with the fact that the system utility is deteriorated under security attack. To 
compare the system utility in game theory with existing methods, we adopt 
probabilistic model checking [29] as the comparison standard to formally model 
the running example and synthesize the adaptation strategy maximizing its 
expectation of the utility by reasoning about reward-based properties [11,7,32]. 
Figure 3 (e) presents the delta between two approaches (i.e., system utility with 
game theory approach minus the utility with the probabilistic model checking 
approach). Without security attacks, the adaptation decision generated by the 
two approaches achieve the same utility. However, with the increasing malicious 
probability of Server? and Server3, game theory approach outperforms, providing 
the better response to make up for the utility loss due to security attack, and 
the average delta is 10.54, i.e., 15 percent outperforming with the average utility 
80.39 achieved by game theory. 


5 Evaluation — Routing Games 
To evaluate our approach and assess its applicability for validation, we consider a 
case study on an interdomain routing application. We first define the game (Sec- 
tion 5.1) and propose a dynamic programming algorithm to solve the equilibrium 
by decomposing the problem into smaller and tractable sub games (Section 5.2). 
The results are present (Section 5.3) with a sensitivity analysis, illustrating how 
the system can choose a robust strategy effective for a range of threat landscapes, 
and a utility analysis by quantifying the defender’s utility with Bayesian game 
compared to a greedy solution within the security context. 

A routing system is usually composed of 
smaller networks called nodes as shown in Fig- 
ure 4. Since not all nodes are directly connected,  pes:N5 
packets often have to traverse several nodes and «> © 
the task of ensuring connectivity between nodes ne 
is called interdomain routing [30,31]. Each node 
could be owned by economic entities (Microsoft, 
AT&T, etc.) and might be compromised by the 
attacker at any time. Therefore, it is natural to 
consider interdomain routing from a game-theoretic point of view. Specifically, 
game players are source nodes located on a network, aiming to send a package 
(i.e., starting at N1) to a unique destination node (i.e., N5). The interaction 
between players is dynamic and complex — asynchronous, sequential, and based 
on partial information - and the best strategy for each player as the adaptation 
response is updated as needed. 


Fig. 4: Routing Scenario. 
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5.1 Game Definition for Interdomain Routing 


The interdomain routing system is described below with the component-based 
definition. 


The components set for the interdomain routing is C = {N1, N2,..., NT}; 
The action space for each node is to deliver the package at hand to its 
neighboring nodes. Typical example is Ay, = {toN2, toN3}; 

The only quality attribute this network needs to be concerned with is the 
time delivering the package to its destination as we assume there is no case of 
package loss. Specifically, we consider the delivery time is proportional to the 
distance denoted by hops between nodes. Its utility function is encoded using 
a formula that enables the quantification of the utility of a given state and 
defined as Usystem = 10 — #hops. Usually, the longer time, the lower utility 
and the maximum utility system could achieve under normal operations for 
this network is 8 with two hops (N1 N2 N5); 


Currently, N2 and N4 are analyzed to be potentially attacked based on 


the historical package delivery record, deliberately sending the package in the 
opposite direction, extending the delivery time. The game definition with the 
security attacks is summarized below. 


The player set for the game is C = {N1, N2,..., N7}. The set of affected 
components by the attack includes N2 and N4, i.e., Catt = {N2, N4}; 
The action set for all players, including malicious ones controlled by attacks, 
is delivering the package to its neighboring nodes. 

The set of types for potential attacked component node includes “normal” and 
‘malicious” (i.e., @v2 E€ {normal, malicious}, Ona E€ {normal, malicious}). 
The payoff for all the normal players is allocated by the system utility with the 
Shapley Value Method (i.e., Usystem + |normal players|, equally allocated 
in this case since all of the nodes in this network is not cut vertex with the 
same importance). For example. each node is awarded 8/7 if none of them 
is attacked. The utility for the ongoing attacks on two components is the 
utility loss from the system’s best response without attack, rendering a case 
of zero-sum game. 

The probability distribution for both component N2 and N4 could be, e.g., 
50% /50% split (i.e., px2,wa(normal, malicious) = (0.5, 0.5). 


6 


5.2 Dynamic Programming Algorithm 

In practice, a network might be complex and each node could have hundreds 
of neighboring nodes. It is impractical to directly build a game tree, in the 
component level with a large number of players (each with a massive action set), 
and solve such a network in a reasonable time. To deal with the complexity of 
network nature, we propose an algorithm inspired by dynamic programming to 
effectively solve the generated Bayesian game for this class of routing problems. 


The algorithm 1 for routing game has as input a routing network N — consisting 


of a starting point s of package delivery and a destination point d. To carry out 
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dynamic programming, the algorithm uses a set subG' to store the set of nodes 
which have been processed with their best reactive strategy. subG is initialized 
as an empty set (line 1) and added with node d (line 2) since d does not need the 
strategy to transmit the package. The algorithm starts by iterating all the nodes 
in the distance disValue (line 5), initialized by 1 (line 3). For example, N2, N4 
and NT are qualified in the first iteration. Each node is checked whether it is 
potentially attacked (i.e., uncertain(n) in line 6). For those uncertain nodes (e.g., 
N2 and N4), they might affect the strategy of their prior nodes (line 7) (e.g., 
N1 and N3), which shall be added to todoS' (line 8), to be processed to update 
their strategy due to its neighboring uncertainty. A typical example is that node 
N3 might trade off the delivery between N4 and N6 even though N4 is in the 
shortest path from N3 to N5, however, could deliberately send the package back 
controlled by the attack. If the node is not in todoS to be updated (line 11), 
it is directly added to the setG (line 12) as the best strategy for such benign 
node is passing the package down to its adjacent node along the shortest path. 
In this routing scenario, N2, N4 and NT is added to subG as their strategies in 
equilibrium with normal type is easily determined. 


After iterating all the nodes in disValue 1, each node in todoS (line 15) is 
checked whether it satisfies the condition (line 16) where all its neighboring 
nodes (i.e., i € adj(n) ) closer to destination (i.e., dis(i, d) == dis(n) — 1) have 
been solved with their best strategies (i.e., in subG), to build a sub-game. As 
shown in the example, though both N1 and N3 are prior to an uncertain node, 
their strategy update is postponed as N6 is not in subG yet, which affects the 
sub-game generation for N3, in turn delaying the sub-game construction for N1. 


An exemplified subgame construction 
(line 17) starting from N3 is illustrated (re) 
in Fig 5 when all conditions are satisfied. type(N2),y  type(N2X 
The stochastic behavior of those poten- a b 
tially compromised nodes can be modeled 


0.5 0.5 


by introducing a nature (or chance player 05 0.5 
y 8 ( player); type(N4)/ type(N4)X —_type(N4)./ type(N4) X 


who moves according to the probability dis- 
tribution (e.g., 50%/50% split), randomly 
determining whether attacks on N2 and  1oN6 ToN4 ToN6 ToN4 ToN6 TON4 To N6 To N4 
N4 are successful. Then, N3 can choose 
an action passing to the one from the set 
of its adjacent nodes, i.e., N6 or N4. Here, 
N3 is a normal node aware of that the 
package is transmitted from N1 and it is 
not necessary to consider a rollback to N1. Fig. 5: Sub-Game for N3. 

The game is ended after N3’s action as we 

can prune the following branches: 1) to N6, the remaining route sequence is N7 
and N5 by default as their best strategy have been solved (i.e., N6 delivers the 
package to N7, which in turn forwards to N5); 2) to N4, with N4 forwarding 
to N5 if it is normal while backing to N3 in malicious type. When the game 
terminates, each player gets a unique payoff following different branches. As 
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Algorithm 1 Dynamic Programming Algorithm to Solve Routing Game. 
. setG =o 


1 

2: addNode(d, setG) 

3: disValue = 1 

4; repeat 

5s forallné€ N and dis(n,d) == disValue do 
6: if uncertain(n) == true then 

7: for all n, € adj(n) and dis(np,d) == disValue + 1 do 
8: addN ode(np, todoS) 

9: end for 

10: end if 

1: if n ¢ todoS then 

12: addN ode(n, setG) 

13: end if 

14: end for 

15: for all n € todoS do 

16: if Vi € adj (n) and dis(i, d) == dis(n) — 1 and i € sutG then 
17: gambitTree = buildGame(n, d) 

18: equilibria + solve(gamebitTree) 

19: removeNode(n, todos) 

20: addNode(n, setG) 

21: end if 

22: end for 


23: disValue < disValue + 1 
24: until s € subG 


shown in the left most rectangle all the players (including N2 and N4 as they 
are benign collaborating nodes) equally share the system utility value 6 with 3 
hops from N3 to N5 plus the shortest path from N1 to N3. However, on the 
rightmost branch, only five players ruling out N2 and N4 is allocated with the 
system utility 4. The system utility is resulting from 6 hops if N3 decides to 
deliver the package to N4 as the nature problematically chooses the malicious 
type for N4, which sends the package back to N3 to maximize the attack’s utility. 
Once N3 receives the package from N4, it redelivers the package to N6 because 
N3 as a good player does not repeatedly send it back. To this end, N2 and N4 is 
uniformly allocated the delta (i.e., 4) between the utility system obtained (i.e., 4) 
and the maximum utility system could obtain (i.e., 8) as the payoff. The payoff 
of the remaining branches can also be calculated accordingly. 

After that, a pure Nash equilibrium is generated by solving this sub-game (line 
18) with Gambit software tools [35], and the best strategy for the node is updated 
according to the equilibrium. By solving the sub-game for N3, the strategy for 
N3 in the equilibrium is to deliver the package to N6, as the potential detriment 
on delayed delivery time to N4 due to attacks is greater than its comparative 
advantage of the shortest path. Thus, this node with the solved strategy is 
removed from todoS (line 19) and absorbed in setG (line 21). Once all the nodes 
in the distance of disValue from the destination have been iterated and all the 
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nodes in todoS satisfying conditions are computed for their best strategy, the 
algorithm increment the value of disValue one unit (line 23) and continue, until 
the starting point s is in the set setG (line 24). 


5.3 Experiment Setup & Results 


We demonstrate how our Bayesian game approach combined with the proposed 
dynamic programming algorithm can produce adaptation decisions about how to 
forward packages for each node in the routing example. Similar to the experiment 
results found on the Znn website, we statically analyzed a discretized region 
of the state space which represented different attack scenarios (i.e., malicious 
probability of N2 and N4). The entire experiment setup of the network structure 
is exactly shown in Figure 4. In addition, we also adopted a greedy algorithm 
for this routing application as the benchmark, and compared the system utility 
between these two approaches to demonstrate the superiority of game theory 
under security attacks. The experiment for the whole state space with Bayesian 
approach takes less than one minute and the solution generation time for each 
state is negligible. 
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Fig. 6: Results for interdomain route example: (a) Expected route in equilibrium; 
(b) System utility with game theory approach; (c) Delta between system utility 
from game theory approach and utility from greedy algorithm. 
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Figure 6 (a) presents the results of the strategy selection (i.e., expected package 
sequence) over two dimensions that correspond to the malicious probability of 
N2 and N4, respectively. Red triangle points denote that the strategy for N1 is 
N2, extending the range of Probability_N2 to around [0, 0.50]. This is because 
when the chance of N2 coming under attack is less than 0.50, N1 should pass the 
package to N2, since N2 is in the shortest path to the destination; otherwise, N1 
delivers the package to N3. Similarly, when the malicious probability of N4 is less 
than 0.35, the strategy for N3 reaching equilibrium is to deliver the package to 
N4 (i.e., blue square points), since the benefits of a short delivery time outweigh 
the potential detriment. For the remaining situations denoted by the black circle 
points, N1 passes the package to N3, which in turn forwards it to N6. 

Figure 6 (b) describes the utility the system could obtain for the attacked 
components’ equilibrium strategies. As expected, when the Probability_N2 is 
greater than 50% and Probability_N4 greater than 35% (i.e., black circle points 
in Figure 6 (a)), the utility system can gain is 6 as there are 4 hops in the 
expected sequence (N1 N3 N6 N7 N5)). This plot also shows that the system 
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utility increases progressively with decreasing probability of the compromised N2 
and N4. When the probability_N2 is 0, the expected utility increases to 8 (i.e., 
two hops in (N1 N2 N5)). Similarly, the utility reaches 7 with probability_N4 0 
(i.e., three hops in (N1 N3 N4 N5)). 

Furthermore, we adopted a baseline that generates strategies for each node 
in a non-repeating fashion, passing the package to the adjacent node along the 
shortest path to the destination. The aim of this was to compare the utility 
between two different approaches dealing with security attacks. For the network 
as shown in Figure 4, the baseline firstly picks up the shortest path sequence 
(N1 N2 N5). If N2 is compromised and sends the package back, N1 redelivers 
it to N3 instead of N2 since the package is received from N2. The system utility 
for the greedy algorithm is the expected value, the weighted average of utility 
for paths in different attack situations. Figure 6 (c) shows the delta between the 
utility produced by our game theory method and the utility produced by the 
baseline. During security attacks, we can see that the utility from the game theory 
approach is always higher than the greedy approach under security attacks. The 
delta is much more noticeable, especially in the situations where N2 and N4 are 
highly likely to be compromised (i.e., Probability_N2 and Probability_N4 close 
to 1). This is because game theory approaches can help the defenders to trade 
off the gains and losses due to perceived risks. 

In summary, based on the preliminary results of our experiment, our game 
theory approach in the component level applies to self-adaptive applications. To 
adopt our approach, attacks information, such as various types with probabilities 
as well as its payoff, shall be provided from the Analyzer, to construct a Bayesian 
game based on system architectural structures. The results have also shown 
that game theory can enhance the performance of the system, especially when 
a potential attack is more likely to happen. In these situations, game theory 
approaches could help the defenders balance perceived risks by using underlying 
incentive mechanisms, and figure out the best response as the adaptation to 
be executed on the network using proven mathematics. Besides, our proposed 
dynamic programming algorithm is specific to this kind of application to optimize 
the game solving. Another potential application is the multi-agent finding (MAPF) 
problem where a spatial position in a path can be viewed as a node in the 
network [39,3]. Other optimization techniques might be adopted or customized 
for different applications with complicated game structures. 


6 Related Work 


Self-adaptive systems under security attacks need to make adaptation decisions 
as a response to a detected threat or to deviations from security goals and require- 
ments [18]. Lorenzoli et al. [34] proposed a technique that could observe values at 
relevant program points and identified the execution contexts leading to a soft- 
ware failure so that mechanisms can be enabled for preventing future occurrences 
of failures of the same type. Bailey et al. [4] generated Role Based Access Control 
(RBAC) models to provide assurances for adaptations against insider threats. 
RBAC technique was also applied to cloud computing environment to provide 
appropriate security services according to the security level and dynamic changes 
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of the common resources [44]. Tsigkanos et al. [41] explored the use of Bigraphical 
Reactive Systems to perform speculative threat analysis through model checking. 
Burmester et al. [5] described a threat model to incorporate typical characteristics 
of systems, such as survivability to abnormal behavior and possibility to recover 
after critically vulnerable states are reached. Dimkov et al. [14] discussed insider 
threats that span physical, cyber and social domains and present a framework 
Portunes integrating all three security domains to describe attacks. Nashif et 
al. [2] presented a multi-level intrusion detection system to detect network attacks 
within three levels of granularities and proactively protected against them by 
employing a fusion decision algorithm. Although, there are many different ways 
of dealing with security attacks in self-adaptive systems, it is notable that the 
application of game theory, with the characteristic of modeling the adversarial 
nature of security attacks and designing reliable defense with proven mathematics, 
has not gained the deserved attention. 


Different sorts of games have been employed to study the actions of the 
defender and attacker. Dijk et al. [42] presented a two-player game that reasons 
about security scenarios where an attacker with uncertainty about its actions may 
periodically gain full control of an asset, with each side trying to maintain control 
as much as possible. An extension work by Farhang et al. [19] explicitly modeled 
the information gains for the attackers as they control assets, improving attacker’s 
capability. Based on these work, Kinneer et al. [28] additionally considered 
multiple attacker types with different goals and capabilities by Bayesian Game. 
Instead of modeling the attackers as independent players, our work models 
the attacks on the component level, focusing on the defender modeling at the 
architecture level and possible deviations of component behaviors. Camara et 
al. [6,8] adopted a game-theoretic perspective and model the system as turn-based 
stochastic multi-player games between different players where players can either 
cooperate to achieve the same goal or compete to achieve their own goals. In 
addition, Glazier et al. [23] used game-based approach to automatically reason 
and synthesize strategies for meta-manager by explicitly considering alternate 
potential future state, thus improving the performance of a collection of autonomic 
systems against a defined quality objective. Though, some of these existing works 
concern about competitive behaviors in a system when some components cannot 
be controlled and even behave according to conflicting goals with respect to other 
components in the system. None of them, to the best of our knowledge, proposed 
to model the Bayesian game in an architecture/component level and captured 
multiple attacks as component’s variant types as well as the uncertainty due to 
unsuccessful compromise. 


Game theory is also increasingly applied to network security. Frigault et 
al. [20] measured the network security in a dynamic environment with dynamic 
Bayesian networks-based model to incorporate temporal factors. Charles et al. [26] 
developed a packet forwarding game model under imperfect private monitoring. 
Their equilibria rely on the probability of cooperation after observing a defection, 
similar to our routing games in the evaluation. However, they looked at this 
problem from the perspective of network nodes, without considering the situation 


Engineering Secure Self-Adaptive Systems with Bayesian Games 147 


of being attacked and how to allocate rewards from the system utility for multiple 
components from the architecture perspective as illustrated in this work. 


7 Conclusion and Future Work 


In this paper, we have proposed a new framework for self-adaptive systems by 
adopting Bayesian game theory and modeled the system under security attacks 
as a multi-player game. An optimal adaptation strategy for responding to attacks 
is generated by computing the equilibrium to the game. One limitation is that 
we validate our approach on a simulated rather than an actual system, and we 
plan to further evaluate the applicability and scalability of the approach using 
case studies involving real systems. A second limitation is the simplification of 
the amount of uncertainty, such as restricting the number of component types 
under attacks and assuming the payoffs with zero-sum game, which might be 
more complex in the real world security landscape. Rather, we attempted to 
convey the idea of transforming the system architecture consisting of multiple 
components under attacks into a Bayesian game. While the equilibrium is sensitive 
to the probability distribution over types (i.e., malicious probability), sensitivity 
analysis are useful when the probability cannot be determined by the analysis 
with precision but lies within a known range. In addition, modeling attacks on 
component level, though more monitorable and easy to handle, cannot depict 
those attacks with highly motivated and capable adversaries willing to devote 
significant time and continuous attack to facilitate their malicious goals, known 
as advanced persistent threats (APTs) [28]. 

Moreover, we adopt pure equilibrium as the adaptation response. However, in 
practice, there will likely be multiple equilibria and no guarantee of uniqueness. 
While this is an area for future work, one possible way to overcome this is to 
choose the equilibrium with highest utility for the system. Another limitation, 
and a topic for future work, is that mixed equilibrium is another common solution 
for game theory. Its interpretation on system behaviors could be various and 
allows generation of different types of defense strategies for the system, which can 
be explored for different applications. For example, if the mixed strategy for N1 
in routing game is choosing N2 and N3 in 50%/50% split as shown in Figure 4, 
we can consider that N1 may equally distribute its packages to N2 and N3 if 
multiple packages exist, or deliver its packages to N3 for the current time and to 
N2 next time. Also, the Bayesian games for these two examples were manually 
created by following the framework into the input language of the Gambit tool, 
to solve the equilibrium. In future, we are planning to construct the game in an 
automated way by supporting an architecture description interchange language, 
such as Acme [22]. 
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Abstract. When developing complex software and systems, contracts 
provide a means for controlling the complexity by dividing the respon- 
sibilities among the components of the system in a hierarchical fashion. 
In specific application areas, dedicated contract theories formalise the 
notion of contract and the operations on contracts in a manner that sup- 
ports best the development of systems in that area. At the other end, 
contract meta-theories attempt to provide a systematic view on the var- 
ious contract theories by axiomatising their desired properties. However, 
there exists a noticeable gap between the most well-known contract meta- 
theory of Benveniste et al. [5], which focuses on the design of embedded 
and cyber-physical systems, and the established way of using contracts 
when developing general software, following Meyer’s design-by-contract 
methodology [18]. At the core of this gap appears to be the notion of pro- 
cedure: while it is a central unit of composition in software development, 
the meta-theory does not suggest an obvious way of treating procedures 
as components. 

In this paper, we provide a first step towards a contract theory that 
takes procedures as the basic building block, and is at the same time 
an instantiation of the meta-theory. To this end, we propose an ab- 
stract contract theory for sequential programming languages with pro- 
cedures, based on denotational semantics. We show that, on the one 
hand, the specification of contracts of procedures in Hoare logic, and 
their procedure-modular verification, can be cast naturally in the frame- 
work of our abstract contract theory. On the other hand, we also show 
our contract theory to fulfil the axioms of the meta-theory. In this way, 
we give further evidence for the utility of the meta-theory, and prepare 
the ground for combining our instantiation with other, already existing 
instantiations. 


1 Introduction 


Contracts. Loosely speaking, a contract for a software or system component is 
a means of specifying that the component obliges itself to guarantee a certain 
behaviour or result, provided that the user (or client) of the component obliges 


itself to fulfil certain constraints on how it interacts with the component. 
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One of the earliest inspirations for the notion of software contracts came 
from the works of Floyd [10] and Hoare [15]. One outcome of this was Hoare 
logic, which is a way of assigning meaning to sequential programs axiomati- 
cally, through so-called Hoare triples. A Hoare triple {P}S{Q} consists of two 
assertions P and Q over the program variables, called the pre-condition and 
post-condition, respectively, and a program S. The triple states that if the pre- 
condition P holds prior to executing S, then, if execution of S terminates, the 
post-condition Q will hold upon termination. With the help of additional, so- 
called logical variables, one can specify, with a Hoare triple, the desired relation- 
ship between the final values of certain variables (such as the return value of a 
procedure) and the initial values of certain other variables (such as the formal 
parameters of the procedure). 

This style of specifying contracts has been advocated by Meyer [18], together 
with the design methodology Design-by-Contract. A central characteristic of this 
methodology is that it is well-suited for independent implementation and verifi- 
cation, where software components are developed independently from each other, 
based solely on the contracts, and without any knowledge of the implementation 
details of the other components. 


Contract Theories. Since then, many other contract theories have emerged, such 
as Rely/Guarantee reasoning [16,22] and a number of Assume/Guarantee con- 
tract theories [4,6]. A contract theory typically formalises the notion of contract, 
and develops a number of operations on contracts that support typical design 
steps. This in turn has lead to a few developments of contract meta-theories 
(e.g. [5,2,8]), which aim at unifying these, in many cases incompatible, contract 
theories. The most comprehensive, and well-known, of these, is presented in Ben- 
veniste et al. [5], and is concerned specifically with the design of cyber-physical 
systems. Here, all properties are derived from a most abstract notion of a con- 
tract. The meta-theory focuses on the notion of contract refinement, and the 
operations of contract conjunction and composition. The intention behind re- 
finement and composition is to support a top-down design flow, where contracts 
are decomposed iteratively into sub-contracts; the task is then to show that the 
composition of the sub-contracts refines the original contract. These operations 
are meant to enable independent development and reuse of components. In ad- 
dition, the operation of conjunction is intended to allow the superimposition of 
contracts over the same component, when they concern different aspects of its 
behaviour. This also enables component reuse, by allowing contracts to reveal 
only the behaviour relevant to the different use cases. 


Motivation and Contribution. The meta-theory of Benveniste et al. focuses on 
the design of embedded and cyber-physical systems. However, there exists a 
noticeable gap between this meta-theory and the way contracts are used when 
developing general software following Meyer’s design-by-contract methodology. 
At the core of this gap appears to be the notion of procedure!. While the proce- 


1 We use the term “procedure”, rather than “function” or “method”, to refer to the 
well-known control abstraction mechanism of imperative programming languages. 
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dure is a central unit of composition in software development, the meta-theory 
does not suggest an obvious way of treating procedures as components. This sit- 
uation is not fully satisfactory, since the software components of most embedded 
systems are implemented with the help of procedures (a typical C-module, for 
instance, would consist of a main function and a number of helper functions), 
and their development should ideally follow the same design flow as that of the 
embedded system as a whole. 

In this paper we provide a first step towards a contract theory that takes 
procedures as the basic building block, and at the same time respects the ax- 
ioms of the meta-theory. Our contract theory is abstract, so that it can be 
instantiated to any procedural language, and similarly to the meta-theory, is 
presented at the semantics level only. Then, in the context of a simplistic imper- 
ative programming language with procedures and its denotational semantics, we 
show that the specification of contracts of procedures in Hoare logic, and their 
procedure-modular verification, can be cast in the framework of our abstract 
contract theory. We also show that our contract theory is an instance of the 
meta-theory of Benveniste et al. With this we expect to contribute to the bridg- 
ing of the gap mentioned above, and to give a formal justification of the design 
methodology supported by the meta-theory, when applied to the software com- 
ponents of embedded systems. Several existing contract theories have already 
been shown to instantiate the meta-theory. In providing a contract theory for 
procedural programs that also instantiates it, we increase the value of the meta- 
theory by providing further evidence for its universality. In addition, we prepare 
the theoretical ground for combining our instantiation with other instantiations, 
which may target components not to be implemented in software. 

Our theoretical development should be seen as a proof-of-concept. In future 
work it will need to be extended to cover more programming language features, 
such as object orientation, multi-threading, and exceptions. 


Related Work. Software contracts and operations on contracts have long been 
an area of intensive research, as evidenced, e.g., by [1]. We briefly mention some 
works related to our theory, in addition to the already mentioned ones. 

Reasoning from multiple Hoare triples is studied in [21], in the context of un- 
available source code, where new properties cannot be derived by re-verification. 
In particular, it is found that two Hoare-style rules, the standard rule of conse- 
quence and a generalised normalisation rule, are sufficient to infer, from a set of 
existing contracts for a procedure, any contract that is semantically entailed. 

Often-changing source code is a problem for contract-based reasoning and 
contract reuse. In [13], abstract method calls are introduced to alleviate this 
problem. Fully abstract contracts are then introduced in [7], allowing reasoning 
about software to be decoupled from contract applicability checks, in a way that 
not all verification effort is invalidated by changes in a specification. 

The relation between behavioural specifications and assume/guarantee-style 
contracts for modal transition systems is studied in [2], which shows how to build 
a contract framework from any specification theory supporting composition and 
refinement. This work is built on in [9], where a formal contract framework based 
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on temporal logic is presented, allowing verification of correctness of contract 
refinement relative to a specific decomposition. 

A survey of behavioural specification languages [14] found that existing lan- 
guages are well-suited for expressing properties of software components, but it 
is a challenge to express how components interact, making it difficult to reason 
about system and architectural level properties from detailed design specifica- 
tions. This provides additional evidence for the gap between contracts used in 
software verification and contracts as used in system design. 


Structure. The paper is organised as follows. Section 2 recalls the concept of con- 
tract based design and the contract meta-theory considered in the present paper. 
In Section 3 we present a denotational semantics for programs with procedures, 
including a semantics for contracts for use in procedure-modular verification. 
Next, Section 4 presents our abstract contract theory for sequential programs 
with procedures. Then, we show in Section 5 that our contract theory fulfils the 
axioms of the meta-theory, while in Section 6 we show how the specification of 
contracts of procedures in Hoare logic and their procedure-modular verification 
can be cast in the framework of our abstract contract theory. We conclude with 
Section 7. 


2 Contract Based Design 


This section describes the concept of contract based design, and motivates its use 
in cyber-physical systems development. We then recall the contract meta-theory 
by Benveniste et al. [5]. 


2.1 Contract Based Design of Cyber-Physical Systems 


Contract based design is an approach to systems design, where the system is 
developed in a top-down manner through the use of contracts for components, 
which are incrementally assembled so that they preserve the desired system-wide 
properties. Contracts are typically described by a set of assumptions the com- 
ponent makes on its environment, and a set of guarantees on the component’s 
behaviour, given that it operates in an environment adhering to the assump- 
tions [5]. 

Present-day cyber-physical systems, such as those found in the automotive, 
avionics and other industries, are extremely complex. Products assembled by 
Original Equipment Manufacturers (OEMs) often consist of components from a 
number of different suppliers, all using their own specialised design processes, 
system architectures, development platforms, and tools. This is also true inside 
the OEMs, where there are different teams with different viewpoints of the sys- 
tem, and their own design processes and tools. In addition, the system itself 
has several different aspects that need to be managed, such as the architecture, 
safety and security requirements, functional behaviour, and so on. Thus, a rigor- 
ous design framework is called for that can solve these design-chain management 
issues. 
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Contract based design addresses these challenges through the principles, at 
the specification level, of refinement and abstraction, which are processes for 
managing the design flow between different layers of abstraction, and composition 
and decomposition, which manage the flow at the same level of abstraction. 
Generally, when designing a system, at the top level of abstraction there will 
be an overall system specification (or contract). This top-level contract is then 
refined, to provide a more concrete contract for the system, and decomposed, 
in order to obtain contracts for the sub-systems, and to separate the different 
viewpoints of the system. A system design typically iterates the decomposition- 
and-refinement process, resulting in several layers of abstraction, until contracts 
are obtained that can be directly implemented, or for which implementations 
already exist. An important requirement on this methodology of hierarchical 
decomposition and refinement of contracts is that it must guarantee that when 
the low-level components implement their concrete contracts, and are combined 
to form the overall system, then the top-level, abstract, contract shall hold. 


Furthermore, a contract framework in particular needs to support indepen- 
dent development and component reuse. That is, specifications for components, 
and their operations, must allow for components and specifications to be inde- 
pendently designed and implemented, and to be used in different parts of the 
system, each with their own assumptions on how the other components, the envi- 
ronment, behave. This is achieved through the principle operations on contracts: 
refinement, composition, and conjunction. 


Refinement allows one to extract a contract at the appropriate level of ab- 
straction. A desired property of refinement is that components which have been 
designed with reference to the more abstract (i.e., weaker) contract do not need 
to be re-designed after the refinement step. That is, in the early stages of devel- 
opment an OEM may have provided a weak contract for some subsystem to an 
external supplier, which implemented a component relying on this contract. As 
development of the system progresses, and the contract is refined, the compo- 
nent supplied externally should still operate according to its guarantees without 
needing to be changed, when instead assuming the new, refined, contract. 


Composition enables one to combine contracts of different components into 
a contract for the larger subsystem obtained when combining the components. 
Again, a desirable property is that other components relying on one or more 
of the individual contracts, can, after composition of the contracts, assume the 
new contract and still perform its guarantees, without being re-designed, thus 
ensuring that subsystems can be independently implemented. 


Finally, contract conjunction is another way of combining contracts, but now 
for the different viewpoints of a single component. This allows one to separate a 
contract into several different, finer contracts for the same component, revealing 
just enough information for each particular system that depends on it, so that 
it can be reused in different parts of the system, or in entirely different systems. 
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2.2 A Contract Meta-Theory 


We consider the meta-theory described in [5]. The stated purpose of the meta- 
theory has been to distil the notion of a contract to its essence, so that it 
can be used in system design methodologies without ambiguities. In particu- 
lar, the meta-theory has been developed to give support for design-chain man- 
agement, and to allow component reuse and independent development. It has 
been shown that a number of concrete contract theories instantiate it, including 
assume /guarantee-contracts, synchronous Moore interfaces, and interface theo- 
ries. To our knowledge, this is the only meta-theory of its purpose and scope. 

We now present the formal definitions of the concepts defined in the meta- 
theory, and the properties that they entail. The meta-theory is defined only in 
terms of semantics, and it is up to particular concrete instantiations to provide 
a syntax. 


Components. The most basic concept in the meta-theory is that of a component, 
which represents any concrete part of the system. Thus, we have an abstract 
component universe M with components m € M. Over pairs of components, we 
have a composition operation x. This operation is partially defined, and two 
components mı and mz are called composable when mı X mz is defined. In such 
cases, we call mı an environment for m2, and vice versa. In addition, component 
composition must be both commutative and associative, in order to ensure that 
different components can be combined in any order. 

Typically, components are open, in the sense that they contain functionality 
provided by other components, i.e., their environment. The environment in which 
a component is to be placed is often unknown at development time, and although 
a component cannot restrict it, it is designed for a certain context. 


Contracts. In the meta-theory, the notion of contract is defined in terms of 


sets of components. The contract universe C df OM yx M consists of contracts 
C = (E, M), where E and M are the sets of environments and implementations 
of C, respectively. Importantly, each pair (m1, M2) E€ Ex M must be composable. 
This definition is intentionally abstract. The intuition is that contracts separate 
the responsibilities of a component from the expectations on its environment. 
Moreover, contracts are best seen as weak specifications of components: they 
should expose just enough information to be adequate for their purpose. 

For a component m and a contract C = (E, M), we shall sometimes write 
m EF C for m € E, and m —E™ C for m € M. A contract C is said to be 
consistent if it has at least one implementation, and compatible if it has at least 
one environment. 


Contract refinement. For two contracts Cı = (E1, Mı) and C2 = (E2, Mo), Ci 
is said to refine C2, denoted Cı < Co, if Mı C Mz and E C E. As an axiom 
of the meta-theory, it is required that the greatest lower bound with respect 
to refinement exists, for all subsets of C. Table 1 summarises the important 
properties of refinement and the other operations on contracts that a concrete 
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Table 1. Properties that hold in theories that adhere to the meta-theory. 


#|Property 

Refinement. When Ci < C2, every implementation of Cı is also an implementation 
of C2. 

Shared refinement. Any contract refining Cı A C2 also refines Cı and C2. 

2 |Any implementation of Cı A C2 is a shared implementation of Cı and C2. 

Any environment for Cı and C2 is an environment for Ci A Co. 

Independent implementability. Compatible contracts can be independently 


: implemented. 

4 Independent refinement. For all contracts C; and C/,i € I, if Ci,i € I are compat- 
ible and C; < C;,i € I hold, then Cj,i € I are compatible and @,-;Ci < @je,Ci 
Commutativity, sub-associativity. For any finite sets of contracts C;,i = 1,...,n, 

5 |e, @ Co =Co@Ci and @y<;enCi S (Qiian Ci) Q Cn holds. 

6 Sub-distributivity. The following holds, if all contract compositions in the 


formula are well defined: ((C11 A C21) Q (C12 A C22)) < ((C11 8 C12) A (C21 Q C22)) 


contract theory needs to possess in order to be considered an instance of the 
meta-theory. 


Contract conjunction. The conjunction of two contracts Cı and C2, denoted 
Cı ACg, is defined as their greatest lower bound w.r.t. the refinement order. (The 
intention is that (Fy, M1) A (E2, M2) should equal (E1 U E2, Mı N M2); however, 
this cannot be taken as the definition since not every such pair necessarily con- 
stitutes a contract.) Then, we have the three desirable properties of conjunction 
listed in Table 1, which together are referred to as shared refinement. 


Contract composition. The composition of two contracts Cı = (E1, Mı) and 
C2 = (E2, M2), denoted Cı @C2 = (E, M), is defined when every two components 
mı € Mı and mz € Mə are composable, and must then be the least contract, 
w.r.t. the refinement order, satisfying the following conditions: 


(i) mı E€ Mi Am E€ Mz > Mı X M E M; 
(ii) e € E ^m € Mı > mı x e € Ez; and 
(iii) e € E Am € Mə > e x mə € Eo. 


If all of the above is satisfied, then properties 3-6 of Table 1 hold. The intention 
is that composing two components implementing Cı and C2 should yield an 
implementation of C1 ® C2, and composing an environment of C1 Q C2 with an 
implementation of Cı should result in a valid environment for C2, and vice versa. 
This is important in order to enable independent development. 


3 Denotational Semantics of Programs and Contracts 


In this section we summarise the background needed to understand the formal 
developments later in the paper. First, we recall the standard denotational se- 
mantics of programs with procedures on a typical toy programming language. 
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Next, we summarise Hoare logic and contracts, and provide a semantic justifi- 
cation of procedure-modular verification, also based on denotational semantics. 


3.1 The Denotational Semantics of Programs with Procedures 


This section sketches the standard presentation of denotational semantics for 
procedural languages, as presented in textbooks such as [23,19]. This semantics 
is the inspiration for the definition of components in our abstract contract theory 
in Section 4.1. We start with a simplistic programming language not involving 
procedures, and add procedures later to the language. 

The following toy sequential programming language is typically used to 
present the denotational semantics of imperative languages: 


S ::= skip | x:=a | S1; S2 | if b then Sı else S2 | while b do S 


where S ranges over statements, a over arithmetic expressions, and b over Boolean 
expressions. 

To define the denotational semantics of the language, we define the set State 
of program states. A state s € State is a mapping from the program variables 
to, for simplicity, the set of integers. 

The denotation of a statement S, denoted [S], is typically given as a partial 
function State — State such that [S] (s) = s’ whenever executing statement S 
from the initial state s terminates in state s’. In case that executing S from s does 
not terminate, the value of [S] (s) is undefined. The definition of [S] proceeds 
by induction on the structure of S. For example, the meaning of sequential 


composition of statements is usually captured with relation composition, as given 


by the equation [S1; S2] A [Si] o [S2]. For the treatment of the remaining 


statements of the language, the reader is referred to [23,19]. 

The definition of denotation captures through its type (as a partial func- 
tion) that the execution of statements is deterministic. For non-deterministic 
programs, the type of denotations is relaxed to [S] C State x State; then, 
(s,s') € [S] captures that there is an execution of S starting in s that termi- 
nates in s’. For technical reasons that will become clear below, we shall use this 
latter denotation type in our treatment. 

Note that we could alternatively have chosen State* as the denotational do- 
main, and most results would still hold in the context of finite-trace semantics. 
However, we chose to develop the theory with a focus on Hoare-logic and de- 
ductive verification. In fact, the domain State x State can be seen as a special 
case of finite traces. In future work, we will also investigate concrete contract 
languages based on this semantics, and extend the theory for that context. 


Procedures and Procedure Calls. To extend the language and its denotational 
semantics with procedures and procedure calls, we follow again the approach 
of [23], but adapt it to an “open” setting, where some called procedures might not 
be declared. We consider programs in the context of a finite set P of procedure 
names (of some larger, “closed” program), and a set of procedure declarations of 
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the form proc p is Sp, where p € P. Further, we extend the toy programming 
language with the statement call p. 


Listing 1.1. An even-odd toy program. 


proc even is if n = 0 then r := 1 else (n := n — 1; call odd); 
proc odd is if n = 0 then r := 0 else (n := n — 1; call even) 


As an example, Listing 1.1 shows a (closed) program in the toy language, 
implementing two mutually recursive procedures. The procedures check whether 
the value of the global variable n is even or odd, respectively, and assign the 
corresponding truth value to the variable r. 

Due to the (potential) recursion in the procedure declarations, the denota- 
tion of call p, and thus of the whole language, cannot be defined by structural 
induction as directly as before. We therefore define, for any set P C P of proce- 
dure names, the set Env p = P — 2StatexState of procedure environments, each 


environment p € Envp thus providing a denotation for each procedure in P. 


Let Env & Upcp Envp be the set of all procedure environments. We define 


a partial order relation E on procedure environments, as follows. For any two 
procedure environments p € Envp and p’ € Envp,, pE p’ if and only if P C P’ 
and Vp € P. p(p) © p'(p). 

Recall that a complete lattice is a partial order, every set of elements of which 
has a greatest lower bound (glb) within the domain of the lattice (see, e.g., [23]). 
It is easy to show that for any P C P, (Envp,C) is a complete lattice, since a 
greatest lower bound will exist within Env p. Then, the least upper bound (lub) 
pi Up of any two function environments pı € Envp, and p2 € Envp, also exists, 
and is the environment p € Envp,up, such that Vp € PyUP 2. p(p) = pı (p)Up2(p). 


We will sometimes need a procedure environment that maps every procedure 
in P to State x State, and we shall denote this environment by pp. 

Next, for sets of procedures, we shall need the notion of interface, which is 
a pair (P~, P*) of disjoint sets of procedure names, where Pt C P is a set of 
provided (or declared) procedures, and P~ C P a set of required (or called, but 
not declared) ones. 

Then, we (re)define the notion of denotation of statements S in the context 
of a given interface (P7, P+) and environments p~ € Env p- and p* € Envp+, 
and denote it by (S1. In particular, we define [call pl? as p` (p) when p € P7 
and as p+ (p) when p € PH. 

Intuitively, the denotation of a call to a procedure should be equal to the de- 
notation of the body of the latter. We therefore introduce, given an environment 
po~ € Envp-, the function € : Env p+ > Env p+ defined by €(p*)(p) = [51% 
for any pt € Envp+ and p € P*, and consider its fixed points. By the Knaster- 
Tarski Fixed-Point Theorem (as stated, e.g., in [23]), since (Envp+,C) is a 


complete lattice and € is monotonic, € has a least fixed-point Ba 
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Finally, we define the notion of standard denotation of statement S in the 
context of a given interface (P~,P*) and environment p~ € Envp-, denoted 


[S],-. by [S],- = [51% where pẹ is the least fixed-point defined above. 

For example, for the closed program in Listing 1.1, we have an interface with 
P+ = {even, odd} and P~ = Ø. Then, (s,s’) € [Senen] if either s(n) = 0 
and s’ = s[r +> 1], or else if s(n) > 0 and (s[n => s(n) — 1], s’) € pt (odd). The 


+ 
denotation [Soaa]f- is analogous. The resulting least fixed-point pf is such that 


+ 
(s,8’) € [Seven],-, or equivalently (s, s”) € [Seven] 7°, whenever s(n) > 0, and 
either s(n) is even and then s’(n) = 0 and s’(r) = 1, or else s(n) is odd and then 
s'(n) = 0 and s’(r) = 0. The standard denotation [Soaa],,- of odd is analogous. 


3.2 Hoare Logic and Contracts 


In this section we summarise the denotational semantics of Hoare logic and 
the semantic justification of procedure-modular verification, as developed by the 
second author in [12]. These formalisations serve as the starting point for the 
definition of contracts in our contract theory developed in Section 4.2. 


Hoare Logic. The basic judgement of Hoare logic [15] is the Hoare triple, written 
{P}S{Q}, where P and Q are assertions over the program state, and S is a 
program statement. The Hoare triple signifies that if the statement S is executed 
from a state that satisfies P (called the pre-condition), and if this execution 
terminates, then the final state of the execution will satisfy Q (called the post- 
condition). Additionally, so-called logical variables can be used within a Hoare 
triple, to specify the desired relationship between the values of variables after 
execution and the values of variables before execution. The values of the program 
variables are defined by the notion of state; to give a meaning to the logical 
variables we shall use interpretations T. We shall write s =z P to signify that 
the assertion P is true w.r.t. state s and interpretation Z. The formal validity of 
a Hoare triple is denoted by par {P}S{Q}, where the subscript signifies that 
validity is in terms of partial correctness, where termination of the execution 
of S is not required. 

An example of a Hoare triple, stating the desired behaviour of procedure odd 
from Listing 1.1, is shown below, where we use the logical variable no to capture 
to the value of n prior to execution of odd: 


{n > 0An = no} Soaa {(Nno mod 2 = 0 => r = 0)A^ (no mod 2 = 1 > r = 1)} (1) 


Procedure even is specified analogously. 

Hoare logic comes with a proof calculus for reasoning in terms of Hoare 
triples, consisting of proof rules for the different types of statements of the pro- 
gramming language. An example is the rule for sequential composition: 


{P} Si {R} {R} 92 {Q} 
{P} Si; 92 {Q} 


COMPOSITION 
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which essentially states that if executing Sı from any state satisfying P termi- 
nates (if at all) in some state satisfying R, and executing S2 from any state 
satisfying R terminates (if at all) in some state satisfying Q, then it is the case 
that executing the composition S1; S2 from any state satisfying P terminates 
(if at all) in some state satisfying Q. The proof system is sound and relatively 
complete w.r.t. the denotational semantics of the programming language (see, 
e.g., [23,19]). 


Hoare Logic Contracts. One can view a Hoare triple {P}S{Q} as a contract 
C = (P,Q) imposed on the program S. In many contexts it is meaningful to 
separate the contract from the program; for instance, if the program is yet to 
be implemented. In our earlier work [12], we gave such contracts a denotational 
semantics as follows: 


def 


[C] = {(s, 8") | YT. (s Fr P > 8’ Fr Q)} (2) 


The rationale behind this definition is the following desirable property: a program 
meets a contract whenever its denotation is subsumed by the denotation of the 
contract, i.e., S par C if and only if [S] € [C]. 

For example, for the contract Coaa induced by (1) we have that (s,s') € 
[Coaa] if and only if either s(n) < 0, or else s’(r) = 0 if s(n) is even and 
s'(r) = 1 if s(n) is odd. The denotation of Ceven is analogous. 


The Denotational Semantics of Programs with Procedure Contracts. Let S be a 
program with procedures, and let every declared procedure p € P be equipped 
with a procedure contract Cp. Procedure-modular verification refers to techniques 
that verify every procedure in isolation. The key to this is to handle procedure 
calls by using the contract of the called procedure rather than its body (i.e., by 
contracting rather than by inlining [7|). In [12], a semantic justification of this 
is given by means of a contract-relative denotational semantics of statements. 
The intuition behind this semantics is that procedure calls are given a meaning 
through the denotation of the contract of the called procedure, rather than 
through the denotation of its body. 

The contract-relative denotational semantics of a statement S, denoted [S$], 
is defined with the help of the contract environment pe that is induced by the 
procedure contracts, i.e., p-(p) = [Cp] for all p € P, as [S]” © [S],,,- Notice 
that this definition does not involve solving any recursive equations (i.e., finding 
fixed points), and gives rise to a contract-relative notion of when a statement 
meets a contract, namely S 67, C if and only if [S]" c [C]. This is exactly 
the correctness notion that is the target of procedure-modular verification. As 
shown in [12], this notion is sound w.r.t. the original notion S par C, in the 
sense that S =por C entails S par C. In other words, verifying a program 
procedure-modularly establishes that the program is correct w.r.t. its contract 
in the standard sense. 

For example, the contract-relative semantics of Seven is such that (s,s’) € 
[Seven] if either s(n) < 0, or s(n) = 0 and s’ = s[r +> 1], or else s/(r) = 1 
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if s(n) is even and s’(r) = 0 if s(n) is odd. The contract-relative ~ 
of Soda | is analogous. Then, it is easy to check that both Seven ES", Ceven and 
Soda K par Coad hold. 


par 


4 An Abstract Contract Theory 


This section presents an abstract contract theory for programs with procedures. 
The theory builds on the basic notion of denotation as a binary relation over 
states. As we will show later, it is both an abstraction of the denotational se- 
mantic view on programs with procedures and procedure contracts presented 
in Sections 3.1 and 3.2, and an instantiation of the meta-theory described in 
Section 2.2. 


4.1 Components 


In the context of a concrete programming language, we view a component as a 
module, consisting of a collection of procedures that are provided by the module. 
The module may call required procedures that are external to the module. The 
way the provided procedures transform the program state upon a call depends 
on how the required procedures transform the state. We take this observation 
as the basis of our abstract setting, in which state transformers are modelled 
as denotations (i.e., as binary relations over states). A component will thus be 
simply a mapping from denotations of the required procedures to denotations of 
the provided ones, both captured through the notion of procedure environments. 

The contract theory is abstract, in that it is not defined for a particular 
programming language, and may be instantiated with any procedural language. 
As with the meta-theory, the abstract contract theory is also defined only on the 
semantic level. 

Recall the notions and notation from Section 3.1. A component interface 
I=(P-,P*) isa pair of disjoint, finite sets of procedure names, of the required 
and the provided ones, respectively. 


Definition 1 (Component). A component m with interface Im = (P3, P+) 
is a mapping m : Env p- > Env p+. 


Let M denote the universe of all components over P. 

We assume that any system is built up from a set of base components, the 
simplest components from which more complex components are then obtained 
by composition. The base components must be monotonic functions over the 
lattice defined in Section 3.1. 

When P, = Ø, we shall identify m with an element of Env pł m other 
words, when a component is closed, i.e., is not dependent on any external pro- 
pediros the provided environment is constant. 


Definition 2 (Component composability). Two components mı and me 
are composable iff Pi O Ph, = Ø. 
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When defining the composition of two components, particular care is required 
in the treatment of procedure names that are provided by one of the components 
while required by the other. Let ux. f(x) denote the least fixed-point of a func- 
tion f, when it exists. 


Definition 3 (Component composition). Given two composable components 

mı : Envp- — Envp+ and m : Envp- — Envp+ , their composition is 
my mi mə m2 

defined as a mapping mı X mə : Env p- > Env p+ such that: 


mı Xxmg mı Xxmg 


+ def p+ + 
P = Pr, U Pina 


mixme 


= def = = 
Fa xis = (Pi U Paa) \ (Pi U Pe) 
def Z 

Mı X Mg = AP ing xm = Envp- z Lp. Aue eee (p) 


xm 


where A : Env p+ — Env p+ is defined, in the context of a given 
miany mı xmo 
m T = 
Pinna = Env p- cnn? 28 follows. Let Pri xms © os pt py? and let pm, E 


Env p>, be the environment defined by: 


ma Pmaixma lO) ifp E Pmi \ Paa 


and let pn, E Envp- be defined symmetrically. We then define: 
me 


+ + def | mi(Pm,) (P) ifp E Pr, 
iena lhin E (E pe Be 


In the above definition, X, ixm, represents the denotations of the procedure 
bodies of the procedures provided by the two composed components, given deno- 
tations of procedure calls to the same procedures. The choice of least fixed-point 
will be crucial for the proof of Theorem 2(i) in Section 4.2 below. 

The definition is well-defined, in the sense that the stated least fixed-points 
exist, and the resulting components are monotonic functions. 


Theorem 1. Component composition is well-defined. 


The existence of a least fixed-point follows from the Knaster-Tarski Fixed-Point 
Theorem, as stated, e.g., in [23]. It can then be shown, by structural induction, 
that composition is well-defined. For lack of space, the proofs of all theorems, 
some of which are conceptually not very involved but rather verbose, are omitted 
here. The full proofs can be found in the accompanying technical report [17]. 


4.2 Denotational Contracts 


We now define the notion of denotational contracts c in the style of assume/guar- 
antee contracts [4,6]. Contracts shall also be given interfaces. 
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Definition 4 (Denotational contract). A denotational contract c with in- 
terface I, = (P7, P+) is a pair (pz, pt), where pp € Envp- and pt € Env p+. 


c 


The intended interpretation of the environment pair is as follows: assuming that 
the denotation of every called procedure p € P7 is subsumed by pù (p), then 
it is guaranteed that the denotation of every provided procedure p' € P* is 


subsumed by p% (p’). 


Definition 5 (Contract implementation). A component m with interface 
Im = (P,P) is an implementation for, or implements, a contract c = 


(pz, pt) with interface I, = (P7, P7), denoted m = c, iff P7 C P}, P} C PY, 
= Ty ~ a+ 


The reason for not requiring the interfaces to be equal is that we aim at a subset 
relation between components implementing a contract and those implementing 
a refinement of said contract, in the meta-theory instantiation. 

For a mapping h : A => B and set A’ C A, let hja; denote as usual the 
restriction of h on A’. 


Definition 6 (Contract environment). A component m is an environment 
for contract c iff, for any implementation m’ of c, m and m’ are composable, 
and VPp mt E Envp- (m x m')(Pmxm pt E pe. 


Intuitively, an environment of a contract c is then a component such that when 
it is composed with an implementation of c, the composition will operate satis- 
factorily with respect to the guarantee of the contract. 

We will now define the refinement relation, and the conjunction and compo- 
sition operations, on contracts. 


Definition 7 (Contract refinement). A contract c refines contract c', de- 
noted c < c, iff py E pz and pt E pt, where E is the partial order relation 
defined in Section 3.1. 


The refinement relation reflects the intention that if a contract c refines another 
contract c’, then any component implementing c should also implement c. 


Definition 8 (Contract conjunction). The conjunction of two contracts 


= _ ; def , _ 2 
cı = (05,, PÈ) and co = (p3, pÈ) is the contract c1 Aco = (p U pa, PE D PÈ), 
where U and NM are the lub and glb operations of the lattice, respectively. 


This definition is consistent with the intention that any contract that refines 
Cı A C2 should also refine cı and cə individually. The interface of cı A c2 is then 
Ianca = (PZ U P3, Pt O PE). Note that while this is the interface in general, 
conjunction of contracts is typically used to merge different viewpoints of the 


same component, and in that case Ie = Ie, = Lei nca. 


Definition 9 (Contract composability). Two contracts cı = (pz,,p¢,) and 
C2 = (Pa, p¢,) with interfaces Ia = (P3, P) and Ie = (PZ, PS) are compos- 


able if: (i) P O PE = Ø, (ü) Yp © PL PŁ. piv) C palp), and 
(iii) Yp € Pa N PÈ. Pi (p) S Pa (p). 
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The conditions for composability ensure that the mutual guarantees of the two 
contracts meet each other’s assumptions. 


Definition 10 (Contract composition). The composition of two composable 
contracts cı = (Pz, PÈ) and co = (Pz, pP}, ), with interfaces Ie = (P3, P$) and 


Cpt 


Ie, = (P3, P+), respectively, is the contract c1 ® c2 = (Pagea: Pe, Ups), where: 


C2? ~ C2 


E E T SEE 
Pereca ~ (Per T Pea) | (P7 UPA PAUPA) 


The interface of c1 Q cz is Tage = ((P3 U P3) \ (PE U PE), P U PE). 


Theorem 2. For any composable contracts cı and c2, and any implementations 
mı = cı and mg = c2, Mı and m2 are composable, and cı ® c2 is the least 
contract (w.r.t. refinement order) for which the following properties hold: 


(i) Mı X M2 = Cy Q Co, 
(ii) if m is an environment to cı ® c2, then mı x m is an environment to ca, 
(iii) ifm is an environment to cı ® cg, then m X mz is an environment to cy. 


5 Connection to Meta-Theory 


In this section we show that the abstract contract theory presented in Section 4 
instantiates the meta-theory described in Section 2.2. 

In our instantiation of the meta-theory, we consider as the abstract compo- 
nent universe M the same universe of components M as defined in Section 4.1. 
To distinguish the contracts of the meta-theory from those of the abstract the- 
ory, we shall always denote the former by C and the latter by c. Recall that a 
contract C is a pair (E, M), where E, M C M. The formal connection between 
the two notions is established with the following definition. 


Definition 11 (Induced contract). Let c be a denotational contract. It in- 
duces the contract Ce = (Ee, Me), where E. G {me M | m is an environment 
for c} and Me 2 {meM |mfEc}. 


Since contract implementation requires that the implementing component’s pro- 
vided functions are a subset of the contract’s provided functions, every compo- 
nent m such that P} N PY = Ø is composable with every component in Me. 

The definitions of implementation, refinement and conjunction of denota- 
tional contracts make this straightforward definition of induced contracts possi- 
ble, so that it directly results in refinement as set membership and conjunction 
as lub w.r.t. the refinement order. 


Theorem 3. The contract theory of Section 4 instantiates the meta-theory of 
Benveniste et al. [5], in the sense that composition of components is associative 
and commutative, and for any two contracts cı and c2: 


(i) cy X c2 iff Ce, refines Ce, according to the definition of the meta-theory, 
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(tt) Conc, is the conjunction of Ce, and Ce, as defined in the meta-theory, and 
(tit) Cage is the composition of Ce, and Ce, as defined in the meta-theory. 


The proof is straightforward, since many definitions of the contract theory are 
deliberately similar to their counterparts in the meta-theory. 

Let us now return to our example from Section 3. When applying Contract 
Based Design, contracts at the more abstract level will be decomposed into 
contracts at the more concrete level. So, for our example, we might have at the 
top level a contract c = (p7, pt) with interface (@, {even, odd}), where p, = Ø, 
and where pf € Envp+ maps even to the set of pairs (s, s’) such that whenever 
s(n) is non-negative and even, then s'(r) = 1, and when s(n) is non-negative 
and odd, then s‘(r) = 0, and maps odd in a dual manner. This contract could 


: def 
then be decomposed into two contracts Ceyen and Coda, so that PÈ (even) = 


pe (even) and pz (odd) e pt (odd), and Coqa is analogous. Then, we would 


have Ceyen Q Coda X C, and for any two components Meven and Moaa such that 
Meven F Ceven aNd Modd = Coda, it would hold that Meven X Moda E C. 


6 Connection to Programs with Procedures 


In this section we discuss how our abstract contract theory from Section 4 relates 
to programs with procedures as presented in Section 3.1, and how it relates to 
Hoare logic and procedure-modular verification as presented in Section 3.2. 
First, we define how to abstract the denotational notion of procedures into 
components in the abstract theory, based on the function € from Section 3.1. 


Definition 12 (From procedure sets to components). For any set of pro- 
cedures P+, calling procedures P', we define the component m : Envp- > 


Env p+, where P; df pr \ P} and P} “ P+ so that Vom E€ Envp-. Vp € 
= [Slp 


Pi- m(Pm)(P) = 


As the next result shows, procedure set abstraction and component compo- 
sition commute. Together with commutativity and associativity of component 
composition, this means that the initial grouping of procedures into components 
is irrelevant, and that one can start with abstracting each individual procedure 
into a component. 


Theorem 4. For any two disjoint sets of procedures pr and Pr, abstracted 
individually into components mı and m2, respectively, and PP U Pr abstracted 
into component m, it holds that mı x mz = m. 


The result is a direct consequence of Definition 12, Definition 3, and the 
well-known Bekić’s Lemma [3] about simultaneous fixed-points. 
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Component abstraction example. Let us illustrate the theorem on our even-odd 
example (however, the example does not really illustrate Bekié’s Lemma, since 
the two procedures do not call themselves). 

By Definition 12, the procedure set {even} is abstracted into component 
Meven : ENVoda} > Envyeyen} with interface ({odd}, {even}), so that Vp~ € 


Env {oad}: m(p~)(even) = [Seven] ,-: By definition, [Seven] ,- is equal to 
[Seven]? , where pq is the least fixed point of £ : Env) cent > EMV i geen} 
defined by €(p*)(even) d Paal for any p* € Envyeyen}. Notice, however, 


that procedure even does not have any calls to itself, so [Seven]? 0 does not re- 
ally depend on p*. Then, for any p~ € Envy,qq}, ($, 8’) € m(p7 )(even) if either 
s(n) = Oand s’ = s[r + 1], or else if s(n) > 0 and (s[n > s(n)—1], s’) € p (odd). 

Similarly, the procedure set {odd} is abstracted into component Modd : 
Env {even} > Env {oqa} with interface ({even}, {odd}), so that Vo~ € Envyeyen}- 
m(p~ )(odd) = [Scaa],-. Then, for any p~ € Envyeveny, (8, 8’) € m(p” )(odd) if 
either s(n) = 0 and s” = s[r+> 0], or else if s(n) > 0 and (s[n > s(n) — 1], 5’) € 
p (even). 

Now, applying Definition 12 to the whole (closed) program yields a com- 
ponent m : Envg > Env{eyen,oda} With interface (9, {even, odd}), so that 
Vp € Envg. Vp € {even, odd}. m(p~ )(p) = [Sp],-. Recall the denotations 
[Seven] ,- and [Scaa] ,- from the end of Section 3.1. 

Components Meven and Moda are composable, and by Definition 3, their 
composition has (the same) interface (@, { even, odd}), and is (also) a mapping 
Meven X Moda : Envg > EnV{ even odd}: 


Finally, note that function Oe acti © Envy even odd} > ENV{ even, oda} 1S 
exactly the function € in the context of the interface (Ø,{even, odd}). This 
can be seen by first noting that since Envo = Ø, we have that Ose Rides 
only depends on its arguments. Furthermore, for all pt € Env{eyen,odd}, if 


def def + 
N Poad> 


fa =P and Plen = P we have that, since odd € P3 


even 


+ + 
| {oda} | {even} 


+ 
then XP coun xrmroag (P* (even) = Meven (Poa) (even) = [Seven] p+, = [Seven] = 


E(pt)(even). Similarly Xh., xma (P7 )(odd) = E(p*) (odd). We therefore have 


Meven X Modd = M. 


We now define how to abstract Hoare logic contracts into denotational con- 
tracts, in terms of the contract environment pe defined in Section 3.2. 


Definition 13 (From Hoare logic contracts to denotational contracts). 
For a procedure p with Hoare logic contract Cp, calling other procedures P~, we 


i r E oy 4 def 
define the denotational contract cp = (pz, p},) with interface P} = {p} and 


P3 © P5, so that o} (p) = pe(p), and Vp! € P~. pz, (p") = pep’). 


In this way, conceptually, denotational contracts become assume /guarantee- 
style specifications over Hoare logic procedure contracts: assuming that all (ex- 
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ternal) procedures called by a procedure p transform the state according to their 
Hoare logic contracts, procedure p obliges itself to do so as well. 

We now show that if a procedure implements a Hoare logic contract, then the 
abstracted component will implement the abstracted contract, and vice versa. 
Together with Theorem 4, this result allows the procedure-modular verification 
of abstract components. 


Theorem 5. For any procedure p with procedure contract Cp, abstracted into 
component M, with contract cp, we have Sp par Cp uf Mp E Cp- 


The result follows mainly from Definitions 12 and 13, and the denotational se- 
mantics given in Section 3. 

Returning to the example from Sections 3 and 5, we can abstract the pro- 
cedure set {even} into component Meven, With interface ({odd}, {even}), which 
would be a function Env{oqa} > Envyeyen}, and Vp~ € Envyoaa}. M(p7 ) (even) 
= [Seven] p= T he denotational contracts Ceyen and Coda resulting from the de- 
composition shown in Section 5, would be exactly the abstraction of the Hoare 
Logic contracts Ceven and Coaq shown in Section 3.2. They would both be part 
of the contract environment used in procedure-modular verification, for example 
when verifying that Seven par Ceven, Which would entail Meven © Ceven. Thus, 
by applying standard procedure-modular verification at the source code level, 
we prove the top-level contract c proposed in Section 5. 


7 Conclusion 


We presented an abstract contract theory for procedural languages, based on de- 
notational semantics. The theory is shown to be an instance of the meta-theory 
of [5], and at the same time an abstraction of the standard denotational seman- 
tics of procedural languages. We believe that our contract theory can be used to 
support the development of cyber-physical and embedded systems by the design 
methodology supported by the meta-theory, allowing the individual procedures 
of the embedded software to be treated as any other system component. The 
work also strengthens the claims of the meta-theory of distilling the notion of 
contracts to its essence, by showing that it is applicable also in the context 
of procedural programs and deductive verification. Finally, this work serves as 
a preparation for combining our contract theory for procedural programs with 
other instantiations of the meta-theory. In future work we plan to investigate 
the utility of our contract theory on real embedded systems taken from the au- 
tomotive industry, where not all components are procedural programs, or even 
software (cf. our previous work, e.g., [11]). We also plan to extend our toy im- 
perative language with additional features, such as procedure parameters and 
return values. Furthermore, we plan to extend the contract theory to capture 
program traces by developing a finite-trace semantics, to enable its use in the 
specification and verification of temporal properties. Lastly, we plan to combine 
our contract theory with an existing contract theory for hybrid systems [20]. 
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Abstract. Systematic testing of autonomous vehicles operating in com- 
plex real-world scenarios is a difficult and expensive problem. We present 
PARACOSM, a framework for writing systematic test scenarios for au- 
tonomous driving simulations. PARACOSM allows users to programmati- 
cally describe complex driving situations with specific features, e.g., road 
layouts and environmental conditions, as well as reactive temporal be- 
haviors of other cars and pedestrians. A systematic exploration of the 
state space, both for visual features and for reactive interactions with 
the environment is made possible. We define a notion of test coverage 
for parameter configurations based on combinatorial testing and low dis- 
persion sequences. Using fuzzing on parameter configurations, our auto- 
matic test generator can maximize coverage of various behaviors and find 
problematic cases. Through empirical evaluations, we demonstrate the 
capabilities of PARACOSM in programmatically modeling parameterized 
test environments, and in finding problematic scenarios. 


Keywords: Autonomous driving - Testing - Reactive programming. 


1 Introduction 


Building autonomous driving systems requires complex and intricate engineering 
effort. At the same time, ensuring their reliability and safety is an extremely 
difficult task. There are serious public safety and trust concerns [63], aggravated 
by recent accidents involving autonomous cars [48]. Software in such vehicles 
combine well-defined tasks such as trajectory planning, steering, acceleration 
and braking, with underspecified tasks such as building a semantic model of 
the environment from raw sensor data and making decisions using this model. 
Unfortunately, these underspecified tasks are critical to the safe operation of 
autonomous vehicles. Therefore, testing in large varieties of realistic scenarios is 
the only way to build confidence in the correctness of the overall system. 
Running real tests is a necessary, but slow and costly process. It is diffi- 
cult to reproduce corner cases due to infrastructure and safety issues; one can 
neither run over pedestrians to demonstrate a failing test case, nor wait for 
specific weather and road conditions. Therefore, the automotive industry tests 
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Paracosm program 


Test Vehicle Pedestrian 
Visual model Physical model Visual model Physical model Road World Collision 
Segment Monitor 
Controller Behavior 
x 


Test Input 
Generator 


Fig. 1: A PARACOSM program consists of parameterized reactive components 
such as the test vehicle, the environment, road networks, other actors and their 
behaviors, and monitors. The test input generation scheme guarantees good 
coverage over the parameter space. The test scenario depicted here shows a test 
vehicle stopping for a jaywalking pedestrian. 


autonomous systems in virtual simulation environments [21, 26, 53, 61, 68, 72]. 
Simulation reduces the cost per test, and more importantly, gives precise control 
over all aspects of the environment, so as to test corner cases. 

A major limitation of current tools is the lack of customizability: they either 
provide a GUI-based interface to design an environment piece-by-piece, or focus 
on bespoke pre-made environments. This makes the setup of varied scenarios 
difficult and time consuming. Though exploiting parametricity in simulation is 
useful and effective [10,23,31,67], the cost of environment setup, and navigating 
large parameter spaces, is quite high [31]. Prior works have used bespoke en- 
vironments with limited parametricity. More recently, programmatic interfaces 
have been proposed [27] to make such test procedures more systematic. However, 
the simulated environments are largely still fixed, with no dynamic behavior. 

In this work, we present PARACOSM, a programmatic interface that enables 
the design of parameterized environments and test cases. Test parameters control 
the environment and the behaviors of the actors involved. PARACOSM supports 
various test input generation strategies, and we provide a notion of coverage for 
these. Rather than computing coverage over intrinsic properties of the system 
under test (which is not yet understood for neural networks [39]), our coverage 
criteria is over the space of test parameters. Figure 1 depicts the various parts 
of a PARACOSM test. A PARACOSM program represents a family of tests, where 
each instantiation of the program’s parameters is a concrete test case. 

PARACOSM is based on a synchronous reactive programming model [13, 35, 
40,70]. Components, such as road segments or cars, receive streams of inputs and 
produce streams of outputs over time. In addition, components have graphical 
assets to describe their appearance for an underlying visual rendering engine and 
physical properties for an underlying physics simulator. For example, a vehicle 
in PARACOSM not only has code that reads in sensor feeds and outputs steering 
angle or braking, but also has a textured mesh representing its shape, position 
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and orientation in 3D space, and a physics model for its dynamical behavior. A 
PARACOSM configuration consists of a composition of several components. Us- 
ing a set of system-defined components (road segments, cars, pedestrians, etc.) 
combined using expressive operations from the underlying reactive programming 
model, users can set up complex temporally varying driving scenarios. For ex- 
ample, one can build an urban road network with intersections, pedestrians and 
vehicular traffic, and parameterize both, environment conditions (lighting, fog), 
and behaviors (when a pedestrian crosses a street). 

Streams in the world description can be left “open” and, during testing, 
PARACOSM automatically generates sequences of values for these streams. We use 
a coverage strategy based on k-wise combinatorial coverage [14,38] for discrete 
variables and dispersion for continuous variables. Intuitively, k-wise coverage 
ensures that, for a programmer-specified parameter k, all possible combinations 
of values of any k discrete parameters are covered by tests. Low dispersion [57] 
ensures that there are no “large empty holes” left in the continuous parameter 
space. PARACOSM uses an automatic test generation strategy that offers high 
coverage based on random sampling over discrete parameters and deterministic 
quasi-Monte Carlo methods for continuous parameters [49,57]. 

Like many of the projects referenced before, our implementation performs 
simulations inside a game engine. However, PARACOSM configurations can also 
be output to the OPENDRIVE format [7] for use with other simulators, which is 
more in-line with the current industry standard. We demonstrate through various 
case studies how PARACOSM can be an effective testing framework for both 
qualitative properties (crash) and quantitative properties (distance maintained 
while following a car, or image misclassification). 

Our main contributions are the following: (I) We present a programmable 
and expressive framework for programmatically modeling complex and parame- 
terized scenarios to test autonomous driving systems. Using PARACOSM one can 
specify the environment’s layout, behaviors of actors, and expose parameters 
to a systematic testing infrastructure. (II) We define a notion of test coverage 
based on combinatorial k-wise coverage in discrete space and low dispersion in 
continuous space. We show a test generation strategy based on fuzzing that the- 
oretically guarantees good coverage. (III) We demonstrate empirically that our 
system is able to express complex scenarios and automatically test autonomous 
driving agents and find incorrect behaviors or degraded performance. 


2 Paracosm through Examples 


We now provide a walkthrough of PARACOSM through a testing example. Sup- 
pose we have an autonomous vehicle to test. Its implementation is wrapped into 
a parameterized class: 


AutonomousVehicle(start, model, controller) { 
void run(...) £ ... } } 


where the model ranges over possible car models (appearance, physics), and the 
controller implements an autonomous controller. The goal is to test this class in 
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many different driving scenarios, including different road networks, weather and 
light conditions, and other car and pedestrian traffic. We show how PARACOSM 
enables writing such tests as well as generate test inputs automatically. 

A test configuration consists of a composition of reactive objects. The follow- 
ing is an outline of a test configuration in PARACOSM, in which the autonomous 
vehicle drives on a road with a pedestrian wanting to cross. We have simplified 
the API syntax for the sake of clarity and omit the enclosing Test class. In the 
code segments, we use ‘:’ for named arguments. 


1 // Test parameters 

2 light = VarInterval(0.2, 1.0) // value in [0.2, 1.0] 
3 nlanes = VarEnum({2,4,6}) // value is 2, 4 or 6 

a // Description of environment 

5 w = World(light:light, fog:0) 

6e // Create a road segment 

7r = StraightRoadSegment(len:100, nlanes:nlanes) 

s // The autonomous vehicle controlled by the SUT 

9 


v = AutonomousVehicle(start:...,model:...,controller:...) 
10 // Some other actor(s) 
11 p = Pedestrian(start:.., model:..., ...) 
12 // Monitor to check some property 
13 C = CollisionMonitor (v) 


14 // Place elements in the world 
15 run_test(env: {w, r, v, pł}, test_params: {light, nlanes}, 
monitors: {c}, iterations: 100) 


An instantiation of the reactive objects in the test configuration gives a scene— 
all the visual elements present in the simulated world. A test case provides 
concrete inputs to each “open” input stream in a scene. A test case determines 
how the scene evolves over time: how the cars and pedestrians move and how 
environment conditions change. We go through each part of the test configuration 
in detail below. 


Reactive Objects. The core abstraction of PARACOSM is a reactive object. Reac- 
tive objects capture geometric and graphical features of a physical object, as well 
as their behavior over time. The behavioral interface for each reactive object has 
a set of input streams and a set of output streams. The evolution of the world is 
computed in steps of fixed duration which corresponds to events in a predefined 
tick stream. For streams that correspond to physical quantities updated by the 
physics simulator, such as position and speeds of cars, etc., appropriate events 
are generated by the underlying physics simulator. 

Input streams provide input values from the environment over time; output 
streams represent output values computed by the object. The object’s construc- 
tor sets up the internal state of the object. An object is updated by event 
triggered computations. PARACOSM provides a set of assets as base classes. 
Autonomous driving systems naturally fit reactive programming models. They 
consume sensor input streams and produce actuator streams for the vehicle 
model. We differentiate between static environment reactive objects (subclassing 


176 R. Majumdar et al. 


camera 


Fig. 2: Reactive streams represented by a marble diagram. A change in the value 
of test parameters nlanes or light changes the environment, and triggers a 
change in the corresponding sensor (output) stream camera. 


Geometric) and dynamic actor reactive objects (subclassing Physical). Environ- 
ment reactive objects represent “static” components of the world, such as road 
segments, intersections, buildings or trees, and a special component called the 
world. Actor reactive objects represent components with “dynamic” behavior: 
vehicles or pedestrians. The world object is used to model features of the world 
such as lighting or weather conditions. Reactive objects can be composed to gen- 
erate complex assemblies from simple objects. The composition process can be 
used to connect static components structurally—such as two road segments con- 
necting at an intersection. Composition also connects the behavior of an object 
to another by binding output streams to input streams. At run time, the values 
on that input stream of the second object are obtained from the output values of 
the first. Composition must respect geometric properties—the runtime system 
ensures that a composition maintains invariants such as no intersection of geo- 
metric components. We now describe the main features in PARACOSM, centered 
around the test configuration above. 


Test Parameters. Using test variables, we can have general, but constrained 
streams of values passed into objects [59]. Our automatic test generator can 
then pick values for these variables, thereby leading to different test cases (see 
Figure 2). There are two types of parameters: continuous (VarInterval) and dis- 
crete (VarEnum). In the example presented, light (light intensity) is a continuous 
test parameter and nlanes (number of lanes) is discrete. 


World. The World is a pre-defined reactive object in PARACOSM with a visual 
representation responsible for atmospheric conditions like the light intensity, 
direction and color, fog density, etc. The code segment 


w = World(light:light, fog:0) 


parameterizes the world using a test variable for light and sets the fog density 
to a constant (0). 


Road Segments. In our example, StraightRoadSegment was parameterized with 
the number of lanes. In general, PARACOSM provides the ability to build com- 
plex road networks by connecting primitives of individual road segments and 
intersections. (A detailed example is presented in our Technical Report [43].) 
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It may seem surprising that we model static scene components such as roads 
as reactive objects. This serves two purposes. First, we can treat the number of 
lanes in a road segment as a constant input stream that is set by the test case, 
allowing parameterized test cases. Second, certain features of static objects can 
also change over time. For example, the coefficient of friction on a road segment 
may depend on the weather condition, which can be a function of time. 


Autonomous Vehicles & System Under Test (SUT). AutonomousVehicle, as well 
as other actors, extends the Physical class (which in turn subclasses Geometric). 
This means that these objects have a visual as well as a physical model. The 
visual model is essentially a textured 3D mesh. The physical model contains 
properties such as mass, moments of inertia of separate bodies in the vehicle, 
joints, etc. This is used by the physics simulator to compute the vehicle’s motion 
in response to external forces and control input. In the following code segment, 
we instantiate and place our test vehicle on the road: 


v = AutonomousVehicle(start:r.onLane(i, 0.1), model: 
CarAsset(...), controller:MyController(...)) 


The start parameter “places” the vehicle in the world (in relative coordinates). 
The model parameter provides the implementation of the geometric and physical 
model of the vehicle. The controller parameter implements the autonomous 
controller under test. The internals of the controller implementation are not 
important; what is important is its interface (sensor inputs and the actuator 
outputs). These determine the input and output streams that are passed to the 
controller during simulation. For example, a typical controller can take sensor 
streams such as image streams from a camera as input and produce throttle and 
steering angles as outputs. The PARACOSM framework “wires” these streams 
appropriately. For example, the rendering engine determines the camera images 
based on the geometry of the scene and the position of the camera and the 
controller outputs are fed to the physics engine to determine the updated scene. 
Though simpler systems like OPENPILOT [15] use only a dashboard-mounted 
camera, autonomous vehicles can, in general, mix cameras at various mount 
points, LiDARs, radars, and GPS. PARACOSM can emulate many common types 
of sensors which produce streams of data. It is also possible to integrate new 
sensors, which are not supported out-of-the-box, by implementing them using 
the game engine’s API. 


Other Actors. A test often involves many actors such as pedestrians, and other 
(non-test) vehicles. Apart from the standard geometric (optionally physical) 
properties, these can also have some pre-programmed behavior. Behaviors can 
either be only dependent on the starting position (say, a car driving straight 
on the same lane), or be dynamic and reactive, depending on test parameters 
and behaviors of other actors. In general, the reactive nature of objects enables 
complex scenarios to be built. For example, here, we specify a simple behavior of 
a pedestrian crossing a road.The pedestrian starts crossing the road when a car 
is a certain distance away. In the code segments below, we use ‘_’ as shorthand 
for a lamdba expression, i.e., “f(_)” is the same as “x => f(x)”. 
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Pedestrian(value start, value target, carPos, value dist, 
value speed) extends Geometric { 

// Initialization 

// Generate an event when the car gets close 

trigger = carPos.Filter( abs(_ - start) < dist ) 

// target location reached 

done = pos.Filter( _ == target ) 

// Walk to the target after trigger fires 

tick.SkipUntil (trigger) .TakeUntil(done).foreach( ... /* 
walk with given speed */ ) 


Monitors and Test Oracles. PARACOSM provides an API to provide qualitative 
and quantitative temporal specifications. For instance, in the following example, 
we check that there is no collision and ensure that the collision was not trivially 
avoided because our vehicle did not move at all. 


// no collision 
CollisionMonitor(AutonomousVehicle v) extends Monitor { 
assert (v.collider.IsEmpty()) } 
// cannot trivially pass the test by staying put 
DistanceMonitor (AutonomousVehicle v, value minD) extends 
Monitor { 
pOld = v.pos.Take(1).Concat(v.pos) 
D = v.pos.Zip(p0Old).Map( abs(_ - _) ).Sum() 
assert(D >= minD) 


} 


The ability to write monitors which read streams of system-generated events 
provides an expressive framework to write temporal properties, something that 
has been identified as a major limitation of prior tools [31]. Monitors for metric 
and signal temporal logic specifications can be encoded in the usual way [18,33]. 


3 Systematic Testing of Paracosm Worlds 


3.1 Test Inputs and Coverage 


Worlds in PARACOSM directly describe a parameterized family of tests. The 
testing framework allows users to specify various strategies to generate input 
streams for both, static, and dynamic reactive objects in the world. 


Test Cases. A test of duration T executes a configuration of reactive objects 
by providing inputs to every open input stream in the configuration for T ticks. 
The inputs for each stream must satisfy const parameters and respect the range 
constraints from VarInterval and VarEnum. The runtime system manages the 
scheduling of inputs and pushing input streams to the reactive objects. Let In 
denote the set of all input streams, and In = Inp U Inc denote the partition of In 
into discrete streams and continuous streams respectively. Discrete streams take 
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their value over a finite, discrete range; for example, the color of a car, the number 
of lanes on a road segment, or the position of the next pedestrian (left/right) are 
discrete streams. Continuous streams take their values in a continuous (bounded) 
interval. For example, the fog density or the speed of a vehicle are examples of 
continuous streams. 


Coverage. In the setting of autonomous vehicle testing, one often wants to 
explore the state space of a parameterized world to check “how well” an au- 
tonomous vehicle works under various situations, both qualitatively and quan- 
titatively. Thus, we now introduce a notion of coverage. Instead of structural 
coverage criteria such as line or branch coverage, our goal is to cover the pa- 
rameter space. In the following, for simplicity of notation, we assume that all 
discrete streams take values from {0,1}, and all continuous streams take values 
in the real interval [0,1]. Any input stream over bounded intervals—discrete or 
continuous—can be encoded into such streams. For discrete streams, there are 
finitely many tests, since each co-ordinate is Boolean and there is a fixed num- 
ber of co-ordinates. One can define the coverage as the fraction of the number 
of vectors tested to the total number of vectors. Unfortunately, the total num- 
ber of vectors is very high: if each stream is constant, then there are already 
2” tests for n streams. Instead, we consider the notion of k-wise testing from 
combinatorial testing [38]. In k-wise testing, we fix a parameter k, and ask that 
every interaction between every k elements is tested. Let us be more precise. 
Suppose that a test vector has N co-ordinates, where each co-ordinate can get 
the value 0 or 1. A set of tests A is a k-wise covering family if for every subset 
{i1,i2,..., in} C {1,..., N} of co-ordinates and every vector v € {0,1}*, there 
is a test t € A whose restriction to the i1,...,7, is precisely v. 

For continuous streams, the situation is more complex: since any continuous 
interval has infinitely many points, each corresponding to a different test case, 
we cannot directly define coverage as a ratio (the denominator will be infinite). 
Instead, we define coverage using the notion of dispersion [49,57]. Intuitively, 
dispersion measures the largest empty space left by a set of tests. We assume a 
(continuous) test is a vector in [0, 1]%’: each entry is picked from the interval [0, 1] 
and there are N co-ordinates. Dispersion over [0,1] can be defined relative to 
sets of neighborhoods, such as N-dimensional balls or axis-parallel rectangles. 
Let us define B to be the family of N-dimensional axis-parallel rectangles in 
[(0,1]%, our results also hold for other notions of neighborhoods such as balls 
or ellipsoids. For a neighborhood B € B, let vol(B) denote the volume of B. 
Given a set A C [0,1]% of tests, we define the dispersion as the largest volume 
neighborhood in 6 without any test: 


dispersion(A) = sup {vol(B) | B € B and AN B = 0} 


A lower dispersion means better coverage. 

Let us summarize. Suppose that a test vector consists of Np discrete co- 
ordinates and Nc continuous co-ordinates; that is, a test is a vector (tp, tc) in 
{0,1}? x [0, 1}%¢. We say a set of tests A is (k,€)-covering if 
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1. for each set of k co-ordinates {71,...,i,} C {1,...,Np} and each vector 
v € {0,1}*, there is a test (tp,tc) € {0,1}? x [0, 1)X¢ such that the 
restriction of tp to the co-ordinates 71,...,i% is v; and 


2. for each (tp, tc) € A, the set {tc | (tp, tc) € A} has dispersion at most e. 


3.2 Test Generation 


The goal of our default test generator is to maximize (k,¢) for programmer- 
specified number of test iterations or ticks. 


k-Wise Covering Family. One can use explicit construction results from combi- 
natorial testing to generate k-wise covering families [14]. However, a simple way 
to generate such families with high probability is random testing. The proof is by 
the probabilistic method [4] (see also [44]). Let A be a set of 2*(klog N — log ô) 
uniformly randomly generated {0, 1 vectors. Then A is a k-wise covering fam- 
ily with probability at least 1 — ô. 


Low Dispersion Sequences. It is tempting to think that uniformly generating 
vectors from [0,1] would similarly give low dispersion sequences. Indeed, as 
the number of tests goes to infinity, the set of randomly generated tests has 
dispersion 0 almost surely. However, when we fix the number of tests, it is well 
known that uniform random sampling can lead to high dispersion [49,57]; in fact, 
one can show that the dispersion of n uniformly randomly generated tests grows 
asymptotically as O((log log n/ n)?) almost surely. Our test generation strategy 
is based on deterministic quasi-Monte Carlo sequences, which have much better 
dispersion properties, asymptotically of the order of O(1/n), than the dispersion 
behavior of uniformly random tests. There are many different algorithms for 
generating quasi-Monte Carlo sequences deterministically (see, e.g., [49,57]). We 
use Halton sequences. For a given e, we need to generate O(4) inputs via Halton 
sampling. In Section 4.2, we compare uniform random and Halton sampling. 


Cost Functions and Local Search. In many situations, testers want to optimize 
parameter values for a specific function. A simple example of this is finding 
higher-speed collisions, which intuitively, can be found in the vicinity of test pa- 
rameters that already result in high-speed collisions. Another, slightly different 
case is (greybox) fuzzing [5,55], for example, finding new collisions using small 
mutations on parameter values that result in the vehicle narrowly avoiding a col- 
lision. Our test generator supports such quantitative objectives and local search. 
A quantitative monitor evaluates a cost function on a run of a test case. Our test 
generation tool generates an initial, randomly chosen, set of test inputs. Then, 
it considers the scores returned by the Monitor on these samples, and performs 
a local search on samples with the highest /lowest scores to find local optima of 
the cost function. 
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4 Implementation and Tests 


4.1 Runtime System and Implementation 


PARACOSM uses the Unity game engine [69] to render visuals, do runtime checks 
and simulate physics (via PhysX [16]). Reactive objects are built on top of UniRx 
[36], an implementation of the popular Reactive Extensions framework [56]. The 
game engine manages geometric transformations of 3D objects and offers easy 
to use abstractions for generating realistic simulations. Encoding behaviors and 
monitors, management of 3D geometry and dynamic checks are implemented 
using the game engine interface. The project code is available at: https://gitlab. 
mpi-sws.org/mathur/paracosm. 

A simulation in PARACOSM proceeds as follows. A test configuration is spec- 
ified as a subclass of the EnvironmentProgramBaseClass.Tests are run by invoking 
the run_test method, which receives as input the reactive objects that should 
be instantiated in the world as well as additional parameters relating to the test. 
The run_test method runs the tests by first initializing and placing the reactive 
objects in the scene using their 3D mesh (if they have one) and then invoking a 
reactive engine to start the simulation. The system under test is run in a sepa- 
rate process and connects to the simulation. The simulation then proceeds until 
the simulation completion criteria is met (a time-out or some monitor event). 


Output to Standardized Testing Formats. There have been recent efforts to cre- 
ate standardized descriptions of tests in the automotive industry. The most 
relevant formats are OPENDRIVE [7] and OPENSCENARIO (only recently 
finalized) [8]. OPENDRIVE describes road structures, and OPENSCENARIO 
describes actors and their behavior. PARACOSM currently supports outputs to 
OPENDRIVE. Due to the static nature of the specification format, a different 
file is generated for each test iteration/configuration. 


4.2 Evaluation 


We evaluate PARACOSM with respect to the following research questions (RQs): 
RQ 1: Does PARACOSM’s programmatic interface enable the easy design of test 
environments and worlds? 

RQ 2: Do the test input generation strategies discussed in Section 3 effectively 
explore the parameter space? 

RQ 3: Can PARACOSM help uncover poor performance or bad behavior of the 
SUT in common autonomous driving tasks? 


Methodology. To answer RQ 1, we develop three independent environments rich 
with visual features and other actors, and use the variety generated with just a 
few lines of code as a proxy for ease of design. To answer RQ 2, we use coverage 
maximizing strategies for test inputs to all the three environments/case studies. 
We also use and evaluate cost functions and local search based methods. To 
answer RQ 3, we test various neural network based systems and demonstrate 


182 R. Majumdar et al. 


Table 1: An overview of our case studies. Note that even though the Adaptive 
Cruise Control study has 2 discrete parameters, we calculate k-wise coverage for 
3 as the 2 parameters require 3 bits for representation. 


Road segmentation Jaywalking pedestrian Adaptive Cruise Con- 
trol 
SUT VGGNet CNN [62] NVIDIA CNN [12] NVIDIA CNN [12] 
Training 191 images 403 image & car con- 1034 image & car con- 
trol samples trol samples 
Test 3 discrete 2 continuous 3 continuous & 2 dis- 
params crete 
Test iters 100 100, 15s timeout 100, 15s timeout 
Monitor Ground truth Scored Collision Collision & Distance 
Coverage k = 3 with probabil- e= 0.041 e = 0.043, k = 3 with 
ity ~ 1 probability ~ 1 


(a) A good test with all parameter val- (b) A bad test with all parameter values 
ues same as the training set (true positive: different from the training set (true posi- 
89%, false positive: 0%). tive: 9%, false positive: 1%). 


Fig. 3: Example results from the road segmentation case study. Pixels with a 
green mask are segmented by the SUT as a road. 


how PARACOSM can help uncover problematic scenarios. A summary of the case 
studies presented here is available in Table 1. In our Technical Report [43], we 
present more case studies, specifically experiments on many pre-trained neu- 
ral networks, busy urban environments and studies exploiting specific testing 
features of PARACOSM. 


4.3 Case Studies 


Road segmentation Using PARACOSM’s programmatic interface, we design a long 
road segment with several vehicles. The vehicular behavior is to drive on their 
respective lanes with a fixed maximum velocity. The test parameters are the 
number of lanes ({2, 4}), number of cars in the environment ({0, 5}) and light 
conditions ({Noon, Evening}). Noon lighting is much brighter than the evening. 
The direction of lighting is also the opposite. We test a deep CNN called VGGNet 
[62], that is known to perform well on several image segmentation benchmarks. 
The task is road segmentation, i.e., given a camera image, identifying which 
pixels correspond to the road. The network is trained on 191 dashcam images 
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Table 2: Summary of results of the road segmentation case study. Each combi- 
nation of parameter values is presented separately, with the parameter values 
used for training in bold. We report the SUT’s average true positive rate (% of 
pixels corresponding to the road that are correctly classified) and false positive 
rate (% of pixels that are not road, but incorrectly classified as road). 


# lanes # cars Lighting # test iters True positive (%) False positive (%) 


2 5 Noon 12 70% 5.1% 
2 5 Evening 14 53.4% 22.4% 
2 0 Evening 12 51.4% 18.9% 
2 0 Noon 12 71.3% 6% 

4 5 Evening 10 60.4% 71% 
4 5 Noon 16 68.5% 20.2% 
4 0 Evening 13 51.5% 7.1% 
4 0 Noon 11 83.3% 21% 


Table 3: Results for the jaywalking pedestrian case study. 


Testing strategy Dispersion ( €) % fail Max. collision 
Random 0.092 7% 10.5 m/s 
Halton 0.041 10% 11.3 m/s 
Random-+opt /collision 0.109 13% 11.1 m/s 
Halton+opt /collision 0.043 20% 11.9 m/s 
Random-+opt/almost failing 0.126 13% 10.5 m/s 
Halton+opt/almost failing 0.043 13% 11.4 m/s 


captured in the test environment with fixed parameters (2 lanes, 5 cars, and 
Noon lighting), recorded at the rate of one image every 1/10"” second, while 
manually driving the vehicle around (using a keyboard). We test on 100 images 
generated using PARACOSM’s default test generation strategy (uniform random 
sampling for discrete parameters). Table 2 summarizes the test results. Tests with 
parameter values far away from the training set are observed to not perform so 
well. As depicted in Figure 3, this happens because varying test parameters can 
drastically change the scene. 


Jaywalking pedestrian. We now test over the environment presented in Section 2. 
The environment consists of a straight road segment and a pedestrian. The 
pedestrian’s behavior is to cross the road at a specific walking speed when the au- 
tonomous vehicle is a specific distance away. The walking speed of the pedestrian 
and the distance of the autonomous vehicle when the pedestrian starts crossing 
the road are test parameters. The SUT is a CNN based on NVIDIA’s behav- 
ioral cloning framework [12]. It takes camera images as input, and produces the 
relevant steering angle or throttle control as output. The SUT is trained on 403 
samples obtained by driving the vehicle manually and recording the camera and 
corresponding control data. The training environment has pedestrians crossing 
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the road at various time delays, but always at a fixed walking speed (1 m/s). In 
order to evaluate RQ 2 completely, we evaluate the default coverage maximizing 
sampling approach, as well as explore two quantitative objectives: first, maxi- 
mizing the collision speed, and second, finding new failing cases around samples 
that almost fail. For the default approach, the CollisionMonitor as presented 
in Section 2 is used. For the first quantitative objective, this CollisionMonitor’s 
code is prepended with the following calculation: 


// Score is speed of car at time of collision 
coll_speed = v.speed.CombineLatest(v.collider, (s,c) => s) 
. First () 


The score coll_speed is used by the test generator for optimization. For the sec- 
ond quantitative objective, the CollisionMonitor is modified to give high scores 
to tests where the distance between the autonomous vehicle and pedestrian is 
very small: 


CollisionMonitor (AutonomousVehicle v, Pedestrian p) 
extends Monitor { 
minDist = v.pos.Zip(p.pos) .Map(1/abs(_-_)).Min(Q 
coll_score = v.collider.Map (0) 
// Score is either 0 (collision) or 1i/minDist 
score = coll_score.DefaultIfEmpty (minDist) 
assert (v.collider.IsEmpty ()) 

} 


We evaluate the following test input generation strategies: (i) Random sam- 
pling (ii) Halton sampling, (iii) Random or Halton sampling with local search 
for the two quantitative objectives. We run 100 iterations of each strategy with 
a 15 second timeout. For random or Halton sampling, we sample 100 times. For 
the quantitative objectives, we first generate 85 random or Halton samples, then 
choose the top 5 scores, and finally run 3 simulated annealing iterations on each 
of these 5 configurations. Table 3 presents results from the various test input gen- 
eration strategies. Clearly, Halton sampling offers the lowest dispersion (highest 
coverage) over the parameter space. This can also be visually confirmed from 
the plot of test parameters (Figure 4). There are no big gaps in the parameter 
space. Moreover, we find that test strategies optimizing for the first objective 
are successful in finding more collisions with higher speeds. As these techniques 
perform simulated annealing repetitions on top of already failing tests, they also 
find more failing tests overall. Finally, test strategies using the second objective 
are also successful in finding more (newer) failure cases than simple Random or 
Halton sampling. 


Adaptive Cruise Control. We now create and test in an environment with our 
test vehicle following a car (lead car) on the same lane. The lead car’s behav- 
ior is programmed to drive on the same lane as the test vehicle, with a certain 
maximum speed. This is a very typical driving scenario that engineers test their 
implementations on. We use 5 test parameters: the initial lead of the lead car to 
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2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 3 10 2 3 4 5 6 7 8 3 10 


(a) Random sampling (no (b) Random + opt. / max- (c) Random + opt. / al- 
opt.) imizing collision. most failing. 


2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 3 10 


(da) Halton sampling (no (e) Halton + opt. / maxi- (f) Halton + opt. / almost 
opt.) mizing collision. failing. 


Fig. 4: A comparison of the various test generation strategies for the jaywalking 
pedestrian case study. The X-axis is the walking speed of the pedestrian (2 to 
10 m/s). The Y-axis is the distance from the car when the pedestrian starts 
crossing (30 to 60 m). Passing tests are labelled with a green dot. Failing tests 
(tests with a collision) are marked with a red cross. 


the test vehicle ([8m, 40m]), the lead car’s maximum speed ([8m/s, 8m/s]), den- 
sity of fog? in the environment ((0, 1]), number of lanes on the road ({2, 4}), and 
color of the lead car ({ Black, Red, Yello, Blue}). We use both, CollisionMonitor 
4 and DistanceMonitor, as presented in Section 2. A test passes if there is no 
collision and the autonomous vehicle moves atleast 5 m during the simulation 
duration (15 s). 

We use PARACOSM’s default test generation strategy, i.e., Halton sampling 
for continuous parameters and Random sampling for discrete parameters (no 
optimization or fuzzing). The SUT is the same CNN as in the previous case 
study. It is trained on 1034 training samples, which are obtained by manually 
driving behind a red lead car on the same lane of a 2-lane road with the same 
maximum velocity (5.5 m/s) and no fog. 

The results of this case study are presented in Table 4. Looking at the dis- 
crete parameters, the number of lanes does not seem to contribute towards a risk 
of collision. Surprisingly, though the training only involves a Red lead car, the 
results appear to be the best for a Blue lead car. Moving on to the continuous 


3 0 denotes no fog and 1 denotes very dense fog (exponential squared scale). 
4 the monitor additionally calculates the mean distance of the test vehicle to the lead 
car during the test, which is used for later analysis. 


186 R. Majumdar et al. 


(a) Initial offset (X-axis) (b) Initial offset (X-axis) (c) Max. speed (X-axis) vs. 
vs. max. speed (Y-axis). vs. fog density (Y-axis). fog density (Y-axis). 


Fig. 5: Continuous test parameters of the Adaptive Cruise Control study plotted 
against each other: the initial offset of the lead car (8 to 40 m), the lead car’s 
maximum speed (3 to 8 m/s) and the fog density (0 to 1). Green dots, red crosses, 
and blue triangles denote passing tests, collisions, and inactivity respectively. 


Table 4: Parameterized test on Adaptive Cruise Control, separated for each value 
of discrete parameters, and low and high values of continuous parameters. A test 
passes if there are no collisions and no inactivity (the overall distance moved by 
the test vehicle is more than 5 m. The average offset (in m) maintained by the 
test vehicle to the lead car (for passing tests) is also presented. 


Discrete parameters Continuous parameters 
Num. lanes Lead car color Initial offset (m) Speed (m/s) Fog density 
2 4 Black Red Yellow Blue < 24 > 24 <55 >55 <0.5 30.5 
Test iters 54 46 24 22 27 27 51 49 52 48 51 49 
Collisions 7 7 3 3 6 2 6 8 8 6 12 o 
Inactivity 12 4 4 4 6 2 9 7 9 7 1 15 
Offset (m) 42.4 43.4 46.5 48.1 39.6 39.1 33.7 52.7 38.4 47.4 36.5 49.8 


parameters, the fog density appears to have the most significant impact on test 
failures (collision or vehicle inactivity). In the presence of dense fog, the SUT 
behaves pessimistically and does not accelerate much (thereby causing a failure 
due to inactivity). These are all interesting and useful metrics about the perfor- 
mance of our SUT. Plots of the results projected on to continuous parameters 
are presented in Figure 5. 


4.4 Results and Analysis 


We now summarize the results of our evaluation with respect to our RQs: 

RQ 1: All the three case studies involve varied, rich and dynamic environments. 
They are representative of tests engineers would typically want to do, and we 
parameterize many different aspects of the world and the dynamic behavior of its 
components. These designs are at most 70 lines of code. This provides confidence 
in PARACOSM’s ability of providing an easy interface for the design of realistic 
test environments. 

RQ 2: Our default test generation strategies are found to be quite effective at 
exploring the parameter space systematically, eliminating large unexplored gaps, 
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and at the same time, successfully identifying problematic cases in all the three 
case studies. The jaywalking pedestrian study demonstrates that optimization 
and local search are possible on top of these strategies, and are quite effective 
in finding the relevant scenarios. The adaptive cruise control study tests over 5 
parameters, which is more than most related works, and even guarantees good 
coverage of this parameter space. Therefore, it is amply clear that PARACOSM’s 
test input generation methods are useful and effective. 

RQ 3: The road segmentation case study uses a well-performing neural network 
for object segmentation, and we are able to detect degraded performance for 
automatically generated test inputs. Whereas this study focuses on static image 
classification, the next two, i.e., the jaywalking pedestrian and the adaptive 
cruise control study uncover poor performance on simulated driving, using a 
popular neural network architecture for self driving cars. Therefore, we can safely 
conclude that PARACOSM can find bugs in various different kinds of systems 
related to autonomous driving. 


4.5 Threats to Validity 


The internal validity of our experiments depends on having implemented our 
system correctly and, more importantly, trained and used the neural networks 
considered in the case studies correctly. For training the networks, we followed 
the available documentation and inspected our examples to ensure that we use 
an appropriate training procedure. We watched some test runs and replays of 
tests we did not understand. Furthermore, our implementation logs events and 
we also capture images, which allow us to check a large number of tests. 

In terms of threats to external validity, the biggest challenge in this project 
has been finding systems that we can easily train and test in complex driving 
scenarios. Publicly available systems have limited capabilities and tend to be 
brittle. Many networks trained on real world data do not work well in simulation. 
We therefore re-train these networks in simulation. An alternative is to run 
fewer tests, but use more expensive and visually realistic simulations. Our test 
generation strategy maximizes coverage, even when only a few test iterations 
can be performed due to high simulation cost. 


5 Related Work 


Traditionally, test-driven software development paradigms [9] have advocated 
testing and mocking frameworks to test software early and often. Mocking frame- 
works and mock objects [42,47] allow programmers to test a piece of code against 
an API specification. Typically, mock objects are stubs providing outputs to ex- 
plicitly provided lists of inputs of simple types, with little functionality of the 
actual code. Thus, they fall short of providing a rich environment for autonomous 
driving. PARACOSM can be seen as a mocking framework for reactive, physical 
systems embedded in the 3D world. Our notion of constraining streams is in- 
spired by work on declarative mocking [59]. 
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Testing Cyber-Physical Systems. There is a large body of work on automated 
test generation tools for cyber-physical systems through heuristic search of a 
high-dimensional continuous state space. While much of this work has focused 
on low-level controller interfaces [6,17,19,20,25,60] rather than the system level, 
specification and test generation techniques arising from this work—for exam- 
ple, the use of metric and signal temporal logics or search heuristics—can be 
adapted to our setting. More recently, test generation tools have started target- 
ing autonomous systems under a simulation-based semantic testing framework 
similar to ours. In most of these works, visual scenarios are either fixed by 
hand [1, 2, 10, 22,27, 29,66, 67], or are constrained due to the model or coverage 
criteria [3, 45,50]. These analyses are shown to be preferable to the application 
of random noise on the input vector. Additionally, a simulation-based approach 
filters benign misclassifications from misclassifications that actually lead to bad 
or dangerous behavior. Our work extends this line of work and provides an ex- 
pressive language to design parameterized environments and tests. ASFAULT [29] 
uses random search and mutation for procedural generation of road networks for 
testing. AC3R, [28] reconstructs test cases from accident reports. 

To address problems of high time and infrastructure cost of testing au- 
tonomous systems, several simulators have been developed. The most popular 
is Gazebo [26] for the ROS [54] robotics framework. It offers a modular and 
extensible architecture, however falls behind on visual realism and complexity of 
environments that can be generated with it. To counter this, game engines are 
used. Popular examples are TORCS [72], CARLA [21], and AirSim [61] Mod- 
ern game engines support creation of realistic urban environments. Though they 
enable visually realistic simulations, and enable detection of infractions such as 
collisions, the environments themselves are difficult to design. Designing a cus- 
tom environment involves manual placement of road segments, buildings, and 
actors (as well as their properties). Performing many systematic tests is there- 
fore time-consuming and difficult. While these systems and PARACOSM share 
the same aims and much of the same infrastructure, PARACOSM focuses on pro- 
cedural design and systematic testing, backed by a relevant coverage criteria. 


Adversarial Testing. Adversarial examples for neural networks [32,64] introduce 
perturbations to inputs that cause a classifier to classify “perceptually identical” 
inputs differently. Much work has focused on finding adversarial examples in the 
context of autonomous driving as well as on training a network to be robust to 
perturbations [11,30,46,51,71]. Tools such as DEEPXPLORE [52], DEEPTEST [65], 
DEEPGAUGE [41], and SADL [37] define a notion of coverage for neural networks 
based on the number of neurons activated during tests compared against the 
total number of neurons in the network and activation during training. However, 
these techniques focus mostly on individual classification tasks and apply 2D 
transformations on images. In comparison, we consider the closed-loop behavior 
of the system and our parameters directly change the world rather than apply 
transformations post facto. We can observe, over time, that certain vehicles are 
not detected, which is more useful to testers than a single misclassification [31]. 
Furthermore, it is already known that structural coverage criteria may not be an 
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effective strategy for finding errors in classification [39]. We use coverage metrics 
on the test space, rather than the structure of the neural network. Alternately, 
there are recent techniques to verify controllers implemented as neural networks 
through constraint solving or abstract interpretation [24, 30,34, 58,71]. While 
these tools do not focus on the problem of autonomous driving, their underlying 
techniques can be combined in the test generation phase for PARACOSM. 


6 Future Work and Conclusion 


Deploying autonomous systems like self-driving cars in urban environments raises 
several safety challenges. The complex software stack processes sensor data, 
builds a semantic model of the surrounding world, makes decisions, plans tra- 
jectories, and controls the car. The end-to-end testing of such systems requires 
the creation and simulation of whole worlds, with different tests representing dif- 
ferent world and parameter configurations. PARACOSM tackles these problems 
by (i) enabling procedural construction of diverse scenarios, with precise control 
over elements like road layout, physical and visual properties of objects, and 
behaviors of actors in the system, and (ii) using quasi-random testing to obtain 
good coverage over large parameter spaces. 

In our evaluation, we show that PARACOSM enables easy design of environ- 
mnents and automated testing of autonomous agents implemented using neural 
networks. While finding errors in sensing can be done with only a few static im- 
ages, we show that PARACOSM also enables the creation of longer test scenarios 
which exercise the controller’s feedback on the environment. Our case studies 
focused on qualitative state space exploration. In future work, we shall perform 
quantitative statistical analysis to understand the sensitivity of autonomous ve- 
hicle behavior on individual parameters. 

In the future, we plan to extend PARACOSM’s testing infrastructure to also aid 
in the training of deep neural networks that require large amounts of high quality 
training data. For instance, we show that small variations in the environment 
result in widely different results for road segmentation. Generating data is a 
time consuming and expensive task. PARACOSM can easily generate labelled 
data for static images. For driving scenarios, we can record a user manually 
driving in a parameterized PARACOSM environment and augment this data by 
varying parameters that should not impact the car’s behavior. For instance, we 
can vary the color of other cars, positions of pedestrians who are not crossing, 
or even the light conditions and sensor properties (within reasonable limits). 
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Abstract. The analysis of behavioral models is of high importance for 
cyber-physical systems, as the systems often encompass complex behav- 
ior based on e.g. concurrent components with mutual exclusion or prob- 
abilistic failures on demand. The rule-based formalism of probabilistic 
timed graph transformation systems is a suitable choice when the mod- 
els representing states of the system can be understood as graphs and 
timed and probabilistic behavior is important. However, model checking 
PTGTSs is limited to systems with rather small state spaces. 

We present an approach for the analysis of large-scale systems modeled 
as probabilistic timed graph transformation systems by systematically 
decomposing their state spaces into manageable fragments. To obtain 
qualitative and quantitative analysis results for a large-scale system, we 
verify that results obtained for its fragments serve as overapproxima- 
tions for the corresponding results of the large-scale system. Hence, our 
approach allows for the detection of violations of qualitative and quanti- 
tative safety properties for the large-scale system under analysis. We con- 
sider a running example in which we model shuttles driving on tracks 
of a large-scale topology and for which we verify that shuttles never col- 
lide and are unlikely to execute emergency brakes. In our evaluation, we 
apply an implementation of our approach to the running example. 


Keywords: cyber-physical systems, graph transformation systems, qual- 
itative analysis, quantitative analysis, probabilistic timed systems, com- 
positional analysis, model checking 


1 Introduction 


Real-time cyber-physical systems often emit a complex behavior based on e.g. 
concurrent components with mutual exclusion or probabilistic failures on de- 
mand. Consequently, modeling formalisms for capturing such systems must 
suitably support the modeling of their complex behaviors. In such a model 
driven approach, the analysis of behavioral models w.r.t. a provided specifica- 
tion is vital to ensure overall soundness of the resulting system. 
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Fig.1: Occurrence of single FT with border and core in LST (left) and five 
occurrences of three FTs in LST overlapping in their borders (right). 


The rule-based transformation of graphs is a suitable choice when the mod- 
els representing states of the system can be understood as graphs. In particular, 
the formalism of probabilistic timed graph transformation systems (PTGTSs) 
extends the standard rule-based transformation of graphs such that timed and 
probabilistic behavior is covered by supporting (a) non-deterministic choice 
among steps, (b) probabilistic choice among step results, and (c) steps repre- 
senting the passage of time. 

A model checking approach for PTGTSs w.r.t. probabilistic metric temporal 
properties was introduced in [19]. However, also this model checking approach 
is limited to systems with rather small state spaces due to the state space 
explosion problem. As a workaround, a selected set of small examples may be 
considered hopefully capturing all system-specific challenges to establish trust 
that the model exhibits the required safe behavior and that unwanted behavior 
is sufficiently unlikely. However, it cannot be excluded that the considered 
small examples do not reveal all the threatening behavior. 

We present a decomposition-based approach for the analysis of large-scale 
systems modeled as PTGTSs to rule out violations of qualitative and quantita- 
tive safety properties. 

As a first step, we capture the underlying static large-scale topology (short 
LST) of a large-scale system as a subgraph that is not changed by graph trans- 
formation, describe how a fragment topology (short FT) can be embedded into 
such an LST (see the left part of Figure 1), and specify how multiple such em- 
beddings of FIs can overlap in their borders (see the right part of Figure 1). 

As a second step, based on the decomposition described by such embed- 
dings, we construct for each FT an adapted PTGTS. Such an adapted PTGTS 
is then ensured to (a) exhibit the same behavior on the non-overlapped part of 
the FT (named core) and to (b) simulate all possible behaviors that can happen 
for any occurrence of the FT in an LST. To obtain the mentioned simulation, 
we include modifications of the rules of the original PTGTS operating on the 
border of an FT into the adapted PTGTS. With this direct relationship between 
behaviors on the FTs and the LST, we obtain that the likelihood of an unwanted 
or forbidden graph pattern in one of the adapated PTGTS is an upper bound 
for its likelihood in its embedding in the large-scale PTGTS. 

As a last step, exploiting our decomposition to counter the state space ex- 
plosion problem, we apply the model checking approach from [19] to the PT- 
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GTSs constructed for the FIs employing its reduction to probabilistic timed 
automata (PTA) instead of applying the model checking approach directly to 
the PTGTS modeling the large-scale system. 


To illustrate our approach, we consider a running example in which we 
model shuttles driving on tracks of an LST and for which we verify that shut- 
tles never collide and are unlikely to execute emergency brakes. In our evalu- 
ation, we apply an implementation of our approach to the running example. 


The idea to decompose a system into subsystems or to compose it from sub- 
systems for the analysis has been studied intensively [25] but our suggested 
compositional approach has distinguishing characteristics. Firstly, the vast ma- 
jority of approaches (like process algebras or similar models) assume that the 
modeling formalism supports the composition/decomposition as a first class 
concept such that compositional analysis techniques are directly applicable as 
the subsystem models cover all possible behaviors in all contexts. In contrast, 
we do not rely on a built-in decomposition operator but rather allow for a 
flexible derivation of an LST decomposition in terms of FIs, overlappings, and 
a suitable overapproximation on the border, which are not predefined by the 
modeling formalism. 


Secondly, several approaches rely on a protocol-like specification of how 
the decomposed subsystems interact, while in our approach the overapprox- 
imation is derived systematically from the PTGTS model that does not nec- 
essarily provide such a protocol-like specification already. The compositional 
analysis approach for graph transformation systems (GTSs) from [24, 11] de- 
fines explicit interfaces, which are used to consider whether the behavior of 
two independent graphs glued via these interfaces (requiring that local tran- 
sitions are compatible) cover jointly all global transitions. Moreover, in further 
approaches, protocols for the roles of collaborations and ports of components 
have been assumed. For example, in [14], the idea to overapproximate the 
environment and border is explored for timed automata with explicit mod- 
els of the roles in form of protocol automata. This idea has been combined 
with dynamic collaborations in [12, 13] captured by timed GTSs (TGTSs) and 
their analysis via inductive invariant checking [3, 4]. Later on, this approach 
has been extended to role, component, and collaboration behavior, which is 
captured by TGTSs and hybrid GTSs in [5] and [2], respectively. However, as 
opposed to the presented approach, in all these cases an explicit concept of 
interface is assumed to separate parts that are analyzed in isolation. 


This paper is structured as follows. In section 2, we introduce our running 
example from the domain of cyber-physical systems. In section 3, we recapit- 
ulate the necessary preliminaries related to PTA and PTGTSs also presenting 
the modeling of our running example. In section 4, we discuss the decompo- 
sition of static substructures of large-scale systems. In section 5, we present 
our decomposition-based approach allowing to split the model checking prob- 
lem into more manageable parts. In section 6, we present an evaluation of the 
conceptual results for our running example. Finally, in section 7, we close the 
paper with a conclusion and an outlook on planned future work. 
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2 Running Example 


We now informally introduce a scenario (based on the RailCab project [23]) of 
autonomous shuttles driving on an LST, which serves as a running example in 
the remainder of this paper. Based on this introduction, we will discuss how 
we model this shuttle scenario as a PIGTS in the next section. 

In the considered shuttle scenario, a track topology containing a large num- 
ber of tracks of approximately equal length is given. Tracks are connected to 
the adjacent tracks via directed connections building in this manner track se- 
quences. Two track sequences can be joined together (i.e., can end up in a 
common track with two predecessors) leading to a join fragment topology (see 
FT8 in Figure 4a) or can split up from a common track (i.e., a common track 
has then two successor tracks) leading to a fork fragment topology (see FT7 
in Figure 4a). Moreover, depots may have a directed connection to a track 
allowing shuttles to enter or exit the track topology. Shuttles, which are al- 
ways located on a single track, may be in mode DRIVE, STOP, or BRAKE. 
Being in mode DRIVE, shuttles drive to the next track (respecting the direc- 
tion of the connection between the tracks) with a certain velocity, which may 
be slow ([3,4] time units per track) or fast ([2,3] time units per track). Regu- 
larly, shuttles change into mode STOP, which allows them to avoid coming too 
close to other shuttles. Moreover, shuttles should slow down before entering 
a track with a construction site on it. However, shuttles noticing the construc- 
tion site too late have to execute an emergency brake thereby changing into the 
mode BRAKE. To reduce the likelihood of such emergency brakes, yellow traf- 
fic lights are installed a few tracks ahead of such construction sites to indicate 
to shuttles that they should slow down. After construction sites, green traffic 
lights may be installed permitting shuttles to increase their velocity. However, 
we also consider failures on demand where a traffic light that is passed by a 
shuttle is not recognized or, for some other reason, not appropriately taken 
into account by the shuttle. We assume a failure probability of 1076 for this 
case assuming that the failure does not only depend on the visual observation 
by the train driver but also depends on a failure of the backup system. 

In our running example, static elements are the tracks, depots, installed 
traffic lights, and construction sites as well as connections between these el- 
ements. The PTGTS modeling the behavior of the described scenario never 
changes this underlying LST. Complementary, dynamic elements are shuttles, 
their attributes, their connections to tracks of the LST as well as the attributes 
of traffic lights. Note that we use later a grammar to generate admissible LSTs. 

For the considered shuttle scenario, we are interested in various properties. 
Firstly, we need to verify that the behavior of the system never gets temporally 
stuck in a state where no steps (discrete steps of e.g. driving shuttles or timed 
steps) are enabled. Secondly, we need to verify whether the rules have been 
constructed in a way ensuring the absence of collisions between shuttles (i.e., 
two shuttles should not be on a common track). Thirdly, emergency brakes 
should be improbable at a local level for a single shuttle but also at the global 
level for the entire LST and its possible numerous number of shuttles. 
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(f) The rule SetSlow: a shuttle may successfully decrease its velocity by setting its time 


(where only the lower end of the interval is stored in the graph) 
1— 1076 or may fail to decrease its velocity with probability 1076. 
attribute to L ensures that the rule cannot be applied twice. 
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S1:Shuttle 


mode=m: 
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clockDrive=d 


mı = DRIVE \ minD, = 2 A m4 = BRAKE ^ unchanged(minDy, tidy, tidz) 


guard: dı > minDy, reset: {d} }, priority: 0, stepLabel: (tid, tid2) 


Fig. 2: Details 


(g) The rule ConstructionSiteBrake: a shuttle with high velocity ([2,3] time units per 
track where only the lower end of the interval is stored in the graph) needs to execute 
an emergency brake to ensure that the track with a construction site on it is not 
entered with a too high velocity. 


for our running example, DPO diagram, and PTA example. 
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guard: dı > minDy, reset: {ds}, priority: 0, stepLabel: (mı, minD4, tid, tidz) 


(a) The rule Drive: a shuttle may drive to the next track where the application condi- 
tion is used to rule out situations that on the next track is a construction site or that 
the considered shuttle comes too close to another shuttle. 
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(b) The rule DriveEnterFast: adaptation of the rule Drive for the case that a new shuttle 
enters the current fragment topology with a high velocity (the similar rule for a shuttle 
with a low velocity has been omitted here for brevity) from a context track belonging 
to another fragment topology. 
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m = DRIVE ^ unchanged (minDy, tidy, tidy) | guard: dı > minDy, reset: {d} }, priority: 0, stepLabel: (tid, tid) 


(c) The rule DriveExit1: adaptation of the rule Drive for the case that a shuttle drives 
onto the last track of the current fragment topology. 


S7:Shuttle © r T1:Track 
er: 
-aff Trah eat ex:next prar) mode=m; a id=tid, 
minDur=minD,| ~ — |clockDrive=d, 


unchanged(tid,) | guard: dı > minDy, reset: Ø, priority: 0, stepLabel: (tid, ) 


(d) The rule DriveExit2: adaptation of the rule Drive for the case that a shuttle exits the 
current fragment topology towards a track belonging to another fragment topology. 


Fig. 3: The rule Drive and the three adapted rules DriveEnterFast, DriveExit1, 
and DriveExit2 for fragment topologies where parts of the application con- 
dition of the rule Drive are omitted due to the overlay specification of the 
running example. 
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3 Preliminaries 


We now briefly introduce the subsequently required details for graph trans- 
formation systems (GTSs) [10], probabilistic timed automata (PTA) [17], and 
probabilistic timed graph transformation systems (PTGTSs) [18, 19] in our no- 
tation. Along this presentation, we also discuss the modeling details for our 
running example from the previous section. 

We employ type graphs (cf. [10]) such as the type graph TG from Figure 2a 
for our running example. A type graph describes the set of all admissible 
(typed attributed) graphs by mentioning the allowed types of nodes, edges, 
and attributes. We assume typed attributed graphs in which attributes are 
specified using a many sorted first-order attribute logic as proposed in [21] 
(the attribute constraint | (false) in TG means that the type graph does not 
restrict attribute values). This approach to attribution has been used to capture 
constraints on attributes in graph conditions in [27] and to describe attribute 
modifications in [22, 28]. 

Graph transformation is then performed by applying a graph transforma- 
tion rule (short rule) ọ = (¢: K— L,r : K— R) consisting of two monomor- 
phisms (i.e., all components of the morphisms are injective). The rule specifies 
that the graph elements in L — ¢(K) are to be deleted, the graph elements in 
K are to be preserved, and the graph elements in R — r(K) are to be added 
during graph transformation. Such a rule is applied to a graph G for a given 
match m : L— G resulting in a graph G” by constructing the double pushout 
(DPO) diagram (see Figure 2c) where the first and the second pushout squares 
describe the removal and the addition of graph elements specified in the rule, 
respectively. Moreover, a rule may additionally contain an application condi- 
tion ¢ (denoted by p = (4,1r,¢)) to rule out certain matches specifying e.g. 
graph elements that may not be connected to graph elements matched by m. 
For further details on the graph transformation approach, we refer to [10]. 

PTA [17] combine the use of clocks to capture real-time phenomena and 
probabilism to approximate/describe the likelihood of outcomes of certain 
steps. A PTA such as the one in Figure 2d consists of (a) a set of locations with a 
distinguished initial location such as £0, (b) a set of clocks such as cg (which are 
initially set to 0), (c) an assignment of a set of atomic propositions (APs) such as 
{done} to each location (for subsequent analysis of e.g. reachability properties), 
(d) an assignment of constraints on its clocks to each location as invariants such 
as cg < 3, and (e) a set of probabilistic timed edges each consisting of (e1) a 
single source location, (e2) at least one target location, (e3) a clock constraint 
such as cg > 2 specifying as a guard when the edge is enabled based on the 
current values of the clocks, (e4) for each target location a probability such 
as 0.5 that this target is reached (the sum of all the probabilities for the target 
locations of the edge must add up to 1 as a probability distribution is required), 
and (e5) for each target location a set of clocks such as {co} to be reset to 0 
when that target location is reached. 

States of a PTA are given by pairs (¢,v) where £ is a location and v is the 
variable valuation mapping each clock of the PTA to a real number. Nonde- 
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terminism arises in PTA since a step for advancing time as well as multiple 
steps applying rules may be enabled in a single state. The logic PTCTL [17] 
then allows to specify properties such as “what is the worst-case probability 
that the PTA reaches a location labeled with the AP done within 5 time units”, 
which can be analyzed by the PRISM model checker [16]. For the example 
PTA from Figure 2d, the given condition is satisfied with probability 0.75 since 
the nondeterminism of the PTA would be resolved (by a so-called adversary) 
such that the PTA first takes a step to 44 without letting time pass and then 
performs the probabilistic step (up to two times after waiting for not longer 
than 2 time units) until it reaches the location £2 labeled with the AP done (the 
probabilistic step cannot be taken a third time due to the requirement of at 
most 5 time units in the quoted property above). 

PTGTSs have been introduced in [18, 19] as a probabilistic real-time ex- 
tension of GTSs. It has been shown that PTGTSs can be translated to PTA 
and, hence, PTGTSs can be understood as a high-level language for PTA as 
discussed below in more detail and can be analyzed using PRISM as well. 

Similarly to PTA, a PTGTS state is given by a pair (G,v) of a graph and a 
clock valuation. The initial state is given by a distinguished initial graph and a 
valuation setting all clocks to 0. In our running example, each attribute of type 
clockDrive of a Track node (cf. Figure 2a) represents one clock. Invariants and 
APs are specified for PTGTSs by means of graph conditions as in Figure 2b and 
Figure 2e, respectively, for our running example. We use the single invariant 
INVariving requiring that shuttles in mode DRIVE cannot be on a track longer 
than the value of their minDur (minimal duration) attribute plus 1. Moreover, 
we consider three APs to specify properties that we want to analyze later on. 
The AP APunexpectedVelocity is used to detect graphs in which a shuttle does not 
have an expected velocity of [2,3] or [3,4] time units per track where only the 
lower end of the interval is stored in the graph in the minDur attribute. The 
AP APyollision is used to detect graphs in which two shuttles are on a common 
track to capture their collision. Finally, the AP APprakeq is used to detect graphs 
in which a shuttle has just executed an emergency brake. 

PTGT rules of a PTGTS then correspond to edges of a PTA and contain 
(a) a left-hand side graph L, (b) an attribute constraint on the clock attributes 
contained in L to capture a guard, (c) a natural number describing a priority 
where higher numbers denote higher priorities, and (d) a nonempty set of tu- 
ples of the form (¢: K— L,r : K— R, ¢,C, p) where (¢,r,@) is an underlying 
GT rule with application condition t, C is a set of clock attributes contained 
in L to be reset, and p is a real-valued probability from [0,1] where the prob- 
abilities of all such tuples must add up to 1. See Figure 2f, Figure 2g, and 
Figure 3a for three PTGT rules SetSlow, ConstructionSiteBrake, and Drive from 
our running example where the last two PTGT rules have a unique underlying 
GT rule with probability 1 and where the first PTGT rule has a higher priority 
as well as two underlying GT rules with probabilities 1076 and 1 — 1076. For 
the PTGT rules ConstructionSiteBrake and Drive, we depict the graphs L, K, and 


1 The underlying GT rule may not delete or add clock attributes. 
ying y 
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R ina single graph (subsequently called LKR-graph) where graph elements to 
be removed and to be added are annotated with © and 9, respectively. In the 
PTGT rule SetSlow, no graph elements are removed or added (i.e., the graphs 
L and R of the underlying GT rules coincide). Nevertheless, for this PTGT 
rule, we depict the two right-hand side morphisms r1 and r2 as they describe 
PTGT steps with different attribute modifications and probabilities. Also, the 
PTGT rules ConstructionSiteBrake and Drive have application conditions, which 
are depicted left to the > symbol or above the v symbol. The attribute precon- 
ditions and attribute modifications are given for each PTGT rule in the red 
box below the LKR-graph (or are split into multiple red boxes as for the PTGT 
rule SetSlow). In these attribute preconditions and attribute modifications, un- 
primed (primed) variables denote the values of attributes before (after) GT 
rule application. Note that if variables are not changed by the GT rule appli- 
cation, we denote this using the operator unchanged (see e.g. Figure 2g where 
unchanged(minDy, tid,, tid?) denotes that the variables minD4, tid,, and tid re- 
main unchanged). Moreover, further information about the PTGT rule (i.e., 
the guard and the priority) but also further information about the probabilis- 
tic choices (i.e., the sets of clocks to be reset and probabilities) are depicted 
in gray boxes. Lastly, we also allow to annotate a PTGT step in the induced 
state space with (a) a name chosen for the probabilistic choice such as success 
and failure in Figure 2f and (b) the values of the variables contained in the list 
stepLabel (which may contain variables from L and R). 

When comparing PTA and PTGTSs, we observe that PTA edges are either 
enabled for the current valuation or not whereas PTGT rules may be applica- 
ble for many matches at the same time (e.g. allowing to apply the Drive for 
one of multiple shuttles). Priorities used in PTGTSs can be encoded in GTSs 
(including PTGTSs) by adding the left-hand side graphs of rules with higher 
priorities as negative application conditions to all rules with a lower priority. 
Similarly, priorities, if integrated into PTA, could be encoded by refining the 
guards. However, for our running example, we can exchange the underlying 
track topology without effort, while this would require a fundamental adap- 
tation of the corresponding PTA. Also, as in [19], we observe in section 6 that 
small PTGTSs result in PTA of considerable size and we therefore conclude 
that PTGTSs are typically much more concise compared to PTA. 


4 Decomposition of Large-Scale Topologies 


We now present our decomposition-based approach to analyze a PTGTS So 
modeling a large-scale cyber-physical system along the lines of the informal 
presentation from the introduction. For our running example, such a PTGTS 
is given by an initial graph typed over the type graph from Figure 2a that is 
restricted later on in a suitable way, 13 PTGT rules of which we present three in 
Figure 2f, Figure 2g, and Figure 3a (further rules are given in [20, Appendix]), 
the invariant from Figure 2b, and the three APs from Figure 2e. 
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(a) FTs for our running example where the red arrows indicate points for topology 
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(c) Decomposition M = {m1, m2, m3} of an LST w.r.t. FT1-FT8. 
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(d) Correspondence of the graph transformation based steps between the large-scale 
system So and one of its fragment systems S;, which are preserving the respective 
static structure given by G and F;. 


Fig. 4: FIs for our running example, rule Merge, example for topology com- 
position, and correspondence between steps in the large-scale system and a 
fragment system. 
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As a first step, we identify a substructure of the initial graph of So that is 
static in the sense that this substructure is preserved and also never extended 
throughout all PTGT steps of So. For large-scale cyber-physical systems such 
as our running example, the existence of such a static substructure may be jus- 
tified by a logical or spatial distribution. The embedding of a static substruc- 
ture G in a given graph G is then captured by a monomorphism «x : G— G 
describing how G is embedded into G. As a special case, such an embedding x 
can be derived for arbitrary graphs G by a monomorphism «rg : TG+ TG de- 
scribing how the given type graph TG is restricted to a smaller type graph TG. 
That is, G then contains only those elements from G that are typed over the 
smaller type graph TG. For our running example, we restrict the type graph 
TG from Figure 2a to such a smaller type graph TG by removing the Shuttle 
node with its attributes, the at edge connected to the Shuttle node, and the 
active attributes from the TLYellow and TLGreen nodes. The graphs G obtained 
from graphs G of So using this restriction are then called large-scale topologies 
(LSTs) and contain for our running example a track topology with depots, traf- 
fic lights, and construction sites. Note that the fact that such an underlying LST 
is indeed preserved and never extended by arbitrary rule applications can be 
verified (at least for our running example) by inspecting each rule individually 
using the technique of 1-induction [9, 26]. 

As a second step, we now introduce the notion of a decomposition of 
the LST into a small set of (constrained) fragment topologies (FIs). Such (con- 
strained) FTs are given by (a) a graph that is typed over the type graph used 
for the LST and (b) a graph condition describing constraints on how the graph 
of the FT may be embedded into graphs of Sp. Moreover, an overlapping specifi- 
cation o is required to describe how the embeddings a; of the graphs of two FTs 
may overlap in the LST. Such an overlapping specification is given by a set of 
spans (01 : OT}, 02 : O —> T2) where O is the permitted overlapping graph that 
is embedded into the two FTs. A decomposition of an LST (in the following 
definition, we simply consider the LST contained in the initial graph Go of So) 
is then given by embeddings of selected FTs into the LST (cf. Figure 1) such 
that the overlapping specification is satisfied (the constraints of the FTs are 
checked for So later on). In applications, to reduce the state space explosion 
problem for the model checking phase later on, it is advantageous to employ 
a low number of small FTs that are strictly constrained and are allowed to 
overlap in a manageable number of ways. 


Definition 1 (Decomposition of LST). If 


— Sq isa PTGTS with initial state sọ = (Go, vo), 

— x : Go — Go is a monomorphism identifying the LST of So contained in Go, 

- F is a set of (constrained) FTs of the form (Fi, Qi), 

- o((F1,$1), (F2,2)) © {(01,02) | 01 : O— Fy, 02 : O Fp} is an overlapping 
specification, which describes how two FTs from F may overlap, 

— Misa list of tuples of the form (F,@,«) where (F,@) € F anda: F — Go, 

— the monomorphisms in M respect the overlapping specification o, i.e., (see [20] for 
a visualization) for all (Fy, 01, «1 : Fy — Go), (Fo, 62, &2 : F2 —> Go) € M there 
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is some pair (01 : O—+ Fy,02 : O— F2) € 0( (Fi, 1), (F2, ¢2)) such that for the 
pushout (g1 : Fy — P, go : Fo + P) of (04,02) (i.e., the overlapping of Fı and Fz 
w.r.t. (01,02)) there is some h : P— G; such that a, = ho gy and «z = h o gp. 


then M is a decomposition of the LST of So w.r.t. x, F, and o. 4 


To provide a better intuition for this definition, we now present the decompo- 
sition of the LST considered for our running example. 


Example 1 (Decomposition for Running Example). Let F contain the constrained 
FTs (FTi, pi) for 1 < i < 8 where each FTi is given in Figure 4a (here we use an 
abbreviated notation where D, T, Y, G, and CS are the obvious abbreviations 
for the node types of the type graph) and where ¢; states in each case that 
shuttles must have a velocity of [2,3] or [3,4] time units per track.” 

Let o((F1, 41), (F2,¢2)) be the overlapping specification stating that over- 
lappings (0; : O — F4, 02 : O — F2) of two FTs are always (for any of the 8 x 8 
combinations) of the form O = Tı — Tz — Tz where Tı and T3 are mapped to 
a Track node in F; and F, with an entering and an exiting red arrow by 0; and 
09, respectively. 

An example of a decomposition of an LST employing the previously men- 
tioned FTs and overlapping specification is given in Figure 4c where three 
FTs are embedded into an LST. To be appropriate later on, the decomposition 
must ensure that all tracks of the LST are covered by embedding morphisms to 
which Shuttle nodes may be connected (e.g. due to Shuttle nodes in the initial 
graph of So or due to connected Depot nodes from which Shuttle nodes may 
enter the LST). In fact, the eight chosen FTs limit the reasoning for our running 
example to LSTs that can be decomposed using these FTs. Q 


In general, we consider the two use cases: (a) a given PTGTS with underlying 
LST is to be analyzed and (b) LSTs are to be constructed based on the se- 
lected and analyzed FTs. Both use cases are supported but require a different 
handling. For the use case (a) a parsing of the LST w.r.t. the given FTs and over- 
lapping specification must be performed to obtain a decomposition of the LST. 
Efficient parsing algorithms have been devised for the special case of hyper- 
edge replacement (HR) grammars (which require that nodes are not deleted) 
in [8, 6, 7]. A suitable graph transformation based grammar for our running 
example with 25 rules is given in [20, Appendix]. For the use case (b) in which 
we need to construct some LST, we may employ node deleting rules. For our 
running example, consider the rule Merge from Figure 4b that can be used to 
iteratively overlap two FTs starting with a disjoint union of copies of FTs. The 
rule Merge overlaps two instances of three successive Track nodes following 
the overlapping specification where the application condition ensures that the 
rule is applied at entry and exit points also excluding the possibility that the 
six matched Track nodes belong to an instance of FTi using =ġpr;. 


? For each FT from Figure 4a, this constraint can be formalized as a graph condition. 
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5 Overapproximation of Behavior 


The decompositions of LSTs introduced in the previous section are now used 
as a foundation to establish a behavioral relationship between a given PTGTS 
So and n PTGTs S; that operate on the instances of FIs that are embedded into 
the LST of So according to the given LST decomposition. 

For this purpose, we extend the structural embeddings given by the a 
monomorphisms from FTs to the LST in Definition 1 to embeddings of the 
entire graph (including the static but also the dynamic parts) of a state of 
some S; called fragment topology state (FTS) into the entire graph of a state of 
So called large-scale state (LSS). Consider the left middle square in Figure 4d 
where the embedding a; together with the FT and LST embeddings x; and x 
is complemented with an embedding e; of the FTS F; into the LSS G. Note 
that e; must be an extension of q; in the sense that the square commutes (i.e., 
KO aj = £j © Ki is required). Also, e; o x; must satisfy the constraint ¢; of the FT 
used for S;. 

To simplify our presentation, we assume that the PTGTS So (as in our 
running example) only employs APs of the form 3(f : Ø— P, T), invariants 
of the form —~3(f : Ø— P, T), and application conditions in PTGT rules that 
are conjunctions of graph conditions of the form =~3(f : > P, T) for some 
graph P. This restriction simplifies the identification of parts of FTSs and LSSs 
that are considered for an evaluation of such graph conditions. 

As a next step, we present a decomposition relation, which establishes a 
relationship between Sp and the PTGTSs S; in terms of embedding monomor- 
phisms x, a;, ei, and x; for all reachable states of Sp. Moreover, the decom- 
position relation requires that (a) the timed and discrete steps of Sp can be 
mimicked by each affected S; and (b) that discrete steps performed by some 
PTGTS S; in isolation on a part of the LST where the FT F; does not overlap 
with the FT F; of another PTGTS S; with i 4 j can be mimicked by Sp. That is, 
the decomposition relation is a simulation for the steps performed by So and a 
bisimulation on those steps that are performed in isolation by a single PTGTS 
Si. Also, to allow to derive results for Sọ from a model checking based analysis 
of the PTGTSs S;, we require a set of APs A that is part of the APs of Sp and 
of each S;. Based on this set A, the decomposition relation also requires that 
only those FTSs and LSSs are related that satisfy the same sets of APs in A. For 
our running example, this set will contain all three APs of So (see Figure 2e). 
Finally, we require that the initial states of Sp and the n PTGTSs S; are covered 
by the decomposition relation. 


Definition 2 (Decomposition Relation). Given 


— (PTGTS FOR LARGE-SCALE SYSTEM) So is a PTGTS with initial LSS sọ = 
(Go, vo) where the LST is identified via Ky : Go“* Go (and preserved by all 
steps of the PTGTS), 

— (PTGTSs For FTs) for each 1 < i < n: S; is a PTGTS with initial FTS sọ; = 
(Foi, voi) where the underlying FT is identified via x; : Fo; —> Fo; (and preserved 
by all steps of the PTGTS), 
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(PRESERVED ATOMIC PROPOSITIONS) A is a set of APs contained in each S;, and 
(DECOMPOSITION OF THE LST) M is a decomposition of size n of the LST of So 
wrt. ko F = {Fo; | 1 < i < n}, and some overlapping specificiation o (cf. 
Definition 1). 


S is a decomposition relation between So and (S1,...,Sn) containing tuples of 
the form ((G,v),x : G-*+G,w) where (G,v) is a state of So, x identifies the LST 
of G, and w is a tuple of length n of tuples of the form (s;, Fi, Qi, &i, Ki, ej) when the 
following items are satisfied. 


1. 


3. 


4. 


(ELEMENTS OF DECOMPOSITION RELATION) The relation S explains how the FTS 
of the PTGTS S; is embedded into the LSS of So, i.e., (see Figure 4d) if ((G,v),« : 
GG,w) € S and ((F;, vi), Fi, Qi, &i, Kiei) is the ith element of w, then s; = 
(F; vi) is a state of S;, (Fi, Qi, ai) is the ith element of M, «; : Fi + G' (embedding 
of FT into LST), e; : Fi > G (embedding of FTS into LSS), e; o xi satisfies pi, and 
Ko a; = ej 0 Ki (embedding e; is an extension of embedding xi), 


. (CONSISTENT VALUATIONS) The clock valuations of each FTS agree with the LSS, 


ie., if ((G,v), K : G—G,w) € S, ((F, vi), Fi, Qi, i, Ki ei) € w, and x;(c;) = c, 
then vi(ci) = v(c). 

(INITIAL STATES RELATED) The initial LSS of So is related, i.e., (s0ọ,Ko,w) € S 
for some w where the ith element (si, Fi, Pi, &i, Ki, ei) of w satisfies s; = sq j. 
(ATOMIC PROPOSITIONS) The labeling with APs is in agreement w.r.t. A, i.e., if 
((G,v),« : G= G,w) € S, ap = A(f : Ø—P,T) € A, the monomorphism 
k : P—G shows that ap is satisfied by G, then there is some 1 < i < n such 
that ((F;, vi), Fi, Qi, &i, Ki ei) is the ith element of w, and there is some k; : P — F; 
showing that ap is satisfied by F; and k = e; o kj. 


. (BISIMULATION OF TIMED STEPS) If ((G,v),x : G—>G,w) € S and So has a 


timed step (not involving a PTGTS rule) from (G, v) to (G,v + ô) then there is 
some ((G,v+6),w’) € S where w' is obtained pointwise from w by applying 
the corresponding timed step to each ((F;, vi), Fi, Qi, &i, Kiei) € w resulting in 
((F;, 0; + ô), Fi, Qi, &i, Ki, ej) and vice versa for a common timed step of each S; of 
duration ô. 


. (SIMULATION OF STRUCTURAL STEPS OF Sọ BY Sj) if 


- ((G,v),«: G—=G,w) € S and 
-— So performs the structural step from (G,v) to (G",v") using an underlying 
GT rule p = (€: K+ L,r: KR, hac) given in Figure 4d where, since the 
step of So preserves the LST, there are unique x’: G—> G! and x" : G+ G" 
such that 60x! =x and x" =? ox’, then 
- ((G",0"),«” : G>G,w") € S for some w” that is obtained pointwise 
from w by adapting each tuple ((F;, vi), Fi, Qi, &i, Kiei) € w into a resulting 
tuple ((F/’, vf), Fi, Qi, ai, xy’, ey’) as follows. If m(L) N e;(F;) = Ø, then all 
components of the tuple remain unchanged. Otherwise, the PTGTS S; must 
simulate the step and the tuple needs the updating described in the following 
steps. 
© There must be a step of S; as given in Figure 4d from F; to F; for some 
underlying rule pj = (4i : Ki Liri : Ki Ri, gc i) with the same 
probability and priority as p. 
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o Since the step of S; preserves the FT, there are unique x; : F; — F} and 
the required xj : F; + Fi! such that bio Ki = k; and xi = ĵi o Kl. 

e The step of S; must allow for e! : F! + G' and el! : F” — G" such that 
Coe! = e; o Ê and f oe! = e!! of; 

7. (SIMULATION OF STRUCTURAL STEPS OF S; ON ITS CORE BY So) if 

- ((G,v),K:GG,w) € S, 

a (CE Vi), Py Qi, Xi, Ki, ei) EW, 

— S; performs the structural step from (F;, vi) to (F/",0//) using an underlying 
GT rule p; = (4i : Ki + Liri : Ki + Ri, baci) given in Figure 4d where, 
since the step of S; preserves the FT, there are unique x, : F; + F; and x!’ : 
F;— F” such that Î; 0x} = x; and xi! =P; o K!, 

— e;(m;(L;)) does not overlap with any e;(F;) for i  j, then 

— there is some ((G",v"), K" : G—> G, w") € S for some G", v", K", and w" 
as follows. 

e There must be a step of So as given in Figure 4d from G to G" for some 
underlying rule p = (€: K— L,r: KR, pac) with the same probabil- 
ity and priority as pj;. 

e Since the step of So preserves the LST, there are unique x' : G— G' and 
the required x" : G— G” such that Îo x! = x and x" = fj ox. 

e The step of So must allow for e; : F! + G! and e} : F? — G” such that 
Îoe! = e; o 6; and f o e! =e! o ĵi. 

e Finally, w” is obtained from w by only adapting the above chosen tuple 
(Fi, vi), Fi, Qi, &i, Kiei) into the tuple (EY, 0"), Fugit a). © 


We now state that decomposition relations allow for the simulation of each 
path of the PTGTS So by the PIGTSs S;. 


Lemma 1 (Existence of Simulating Paths). If S is a decomposition relation be- 
tween So and (S1,...,Sn), and 7 is a path of length m in So from the initial state to 
a state Sm, then, for each 1 <i < n, there is a path 7; of S; (of length k; < m) ending 
in a state six, such that (Sm,K,W) € S for some «x and w where the ith element of w is 
of the form (Spey Fi Qi, &i, Ki, ej). Moreover, the probability of each such path 7; is at 
least as high as the probability of the path 7. See [20] for the proof. 


We now state that a PTGTS satisfies a safety property given by an AP, when 
safety w.r.t. this AP can be established for each Sj. 


Theorem 1 (Safety Verification). If S is a decomposition relation between So and 
(Sy,...,Sn) wrt A and ap € A, then So is safe w.r.t. the occurrence of an ap-labeled 
graph when (for each 1 <i < n) S; is safe w.r.t. the occurrence of an ap-labeled graph. 
Moreover, the probability of an occurrence of an ap-labeled graph from some state s in 
So is smaller than the probability of an occurrence of an ap-labeled graph from some 
S-related state s; in S;. See [20] for the proof. 


We now apply the proposed methodology of establishing a behavioral rela- 
tionship between the PTGTS Sp and the PIGTSs S; to our running example. 
For this purpose, we now describe how the FTS of each S; is embedded into 
the LSS of Sp and, based on this embedding, how the S; is derived from So. 
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Example 2 (Construction of Embeddings and Simulating PTGTSs). Firstly, the em- 
beddings e; of FTSs into the LSS are obtained as extensions of the structural 
embeddings x; by also matching (a) all Shuttle nodes (with their attributes) that 
are connected to Track nodes contained in the FT via next edges and (b) all active 
attributes of TLYellow and TLGreen nodes contained in the FT. This extension 
also naturally applies to the initial state of Sp. Clearly, two embeddings e; and 
ej (for i Æ j) only overlap in elements of their FTs but not in the additionally 
matched dynamic elements. 

Secondly, we adapt the given PTGTS Sp to obtain for each of the eight FTs 
one PTGTS S; by (a) changing the initial graph to the source of e; capturing 
the FT as well as the additional dynamic elements of the initial state of So 
connected to it, (b) adding eight rules for overapproximating the behavior 
of So on the tracks that may overlap with tracks of other FTs. For the latter 
point, we observe that all but three of the rules of Sọ (including SetSlow and 
ConstructionSiteBrake from Figure 2) are never applicable on the parts of FTs 
that may overlap with other FTs (i.e., borders of FIs). The remaining three rules 
are Drive from Figure 3a as well as two similar rules for stopping the shuttle 
that we do not consider in detail here. Three of the four derived rules for rule 
Drive are given in Figure 3. 

The additional rule DriveEnterFast is used to simulate Drive steps where a 
shuttle in Sp drives from a track not covered by S; to a track covered by Sj. 
The rule DriveEnterFast is essentially constructed by omitting the source track 
Tı from the rule Drive, by adding the shuttle with one of the two expected 
velocities (the other velocity results in the omitted rule DriveEnterSlow)3, and 
by omitting application conditions that may not be satisfied due to the over- 
lapping specification and the structure of FTs. 

Similarly, the additional rules DriveExit1 and DriveExit2 are constructed 
from rule Drive to allow for the simulation of the two steps in which a shut- 
tle in Sp drives using rule Drive on two tracks covered by Sj; to a track not 
covered by S;. These two rules are then constructed similarly, by omitting the 
tracks T3 (for DriveExit1) and Tz and T4 (for DriveExit2) from rule Drive as 
these are not covered by the S;, by removing the shuttle with its attributes in 
rule DriveExit2, by omitting application conditions that may not be satisfied 
due to the overlapping specification and the structure of FTs, and by omitting 
application conditions that refer to the removed tracks. 

Note that these additional rules overapproximate the behavior that is possi- 
ble in Sp as they may be used when analyzing S; also when no corresponding 
shuttle in So is able to enter the FT or when rule Drive would be disabled 
due to the omitted application conditions for the case of rules DriveExit1 and 
DriveExit2. © 


For our running example, we now describe the construction of a suitable de- 
composition relation relying on the LST decomposition introduced before. 


3 Here, we rely on the constraints on the eight FTs (cf. Example 1) requiring that the 
AP AP unexpectedVelocity 18 never labeled in the large-scale system So. 
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Lemma 2 (Existence of Decomposition Relation for Running Example). For 
the PTGTS So of our running example with an arbitrary initial LST such that M is a 
decomposition of that LST w.r.t. some monomorphism x, the set of eight FTs, and the 
overlapping specification o from Example 1 there is a decomposition relation S between 
So and the n PTGTSs S; from Example 2. See [20] for the proof. 


Based on this decomposition relation and Theorem 1, we can obtain the desired 
overapproximation result for So for the qualitative safety w.r.t. collisions and 
the quantitative unlikeliness of emergency brakes. 


Corollary 1 (Qualitative and Quantitative Safety for Running Example). So 
exhibits no collisions when this is the case for each S;. Moreover, emergency brakes 
are performed in So with a probability not higher than the probability of such an 
occurrence in any Sj. 


Note that we only need to analyze one PTGTS for each of the eight permitted 
FTs w.r.t. the occurrence of collisions and the probability of emergency brakes. 


6 Evaluation 


To analyze the eight PTGTSs constructed for our running example in section 5 
(see Table 1 for the results), we have employed the methodology from [19] 
generating the state spaces for these PTGTSs without timed steps and then 
generated the corresponding PTA from these state spaces. We then restricted 
these PTA to timed automata (TA) essentially removing the information on 
probabilities, applied UPPAAL [15] to determine the edges of the TA that can 
never be applied due to unsatisfiable guards, and removed the correspond- 
ing edges from the previously generated PTA. The entire analysis using our 
prototypical implementation required less than three days on a machine using 
up to 250GB memory where the state space generation required most of the 
time. However, there is a vast potential for optimizations regarding memory 
consumption (by only storing subsequently relevant information on states and 
steps) and runtime (by facilitating concurrency during state space generation). 

Firstly, using UPPAAL, we have verified that each of the eight TA (hence, 
also the eight PTA) have no reachable deadlock (where also timed steps are 
disabled). Hence, we obtain that the PIGTS Sp also does not contain this par- 
ticular modeling error since, using the decomposition relation, we also obtain 
that every deadlock reachable in Sp can be reached analogously in each Sj. 

Secondly, we have observed that the obtained PTA do not label any lo- 
cation with AP unexpectedVelocity OT AP coltision- For AP unexpected Velocity this means that 
the additional rules such as DriveEnterFast and DriveEnterSlow for overapprox- 
imating the steps of entering shuttles entirely cover all possible velocities of 
shuttles. For AP ojision this means that Corollary 1 implies that the PTGTS So 
with an LST constructed in the described way from the eight FTs is safe w.r.t. 
the occurrence of collisions. 

Thirdly, to verify that yellow traffic lights suitably slow down the shuttles 
before construction sites, we have identified locations @; in the resulting PTA 


Compositional Analysis of PTGTSs 213 


Table 1: Results of our evaluation for the running example 

fragment topology states steps collisions max. probability for violating the 
velocity limit at a construction site 

FT1 9 18 0 0 

FT2 335 693 0 0 

FT3 216 503 0 0 

FT4 109 379 312915 0 1x 10-6 

FT5 106122 284102 0 1x107 

FT6 12473 31812 0 0 

FT7 4048 16314 0 0 

FT8 121 953 452 340 0 0 


that are labeled with APbrakea (occurring only in FT4 and FT5). In each case, we 
were able to track using a custom analysis algorithm (since the PRISM model 
checker was too slow for the large PTA at hand) the shuttle backwards over 
all possible paths leading to such a location £; up to the step where the shuttle 
entered the FT. We then determined the maximal probability of any such path 
obtaining a worst-case emergency brake probability of 1076 and 1071? for any 
entering shuttle in FT4 and FT5, respectively. On the one hand, FT5 is thereby 
verified to be quantitatively more desirable compared to FT4. On the other 
hand, Corollary 1 implies that installations of yellow traffic lights as in FT4 
and FT5 suitably decrease the likelihood of emergency brakes also for So. 
However, the probabilities that some shuttle executes an emergency brake in a 
given time span in FT4/FT5 (obtained by combining the maximal throughput 
of shuttles for FT4/FT5 with the worst-case probability obtained for FT4/FT5) 
can be expected to be too coarse upper bounds when the maximal throughput 
is not to be expected for the real system. 


7 Conclusion and Future Work 


We presented an analysis approach for large-scale systems modeled as PT- 
GTSs for which model checking is not feasible. In this approach, we rely on 
a decomposition of an underlying static large-scale topology into fragment 
topologies of manageable size. Model checking is then applied for each frag- 
ment topology and an adaptation of the PTGTS to such a fragment topology. 
We thereby determine (a) overapproximations of reachability properties im- 
portant for qualitative safety properties and (b) upper bounds for probabilistic 
reachability properties important for quantitative safety properties. 

As future work, we intend to extend our analysis to fairness properties 
and conditions of the metric temporal graph logic (MTGL) [29]. Also, to cover 
further aspects of the RailCab project [23], we will develop more general de- 
composition schemes where dynamic components (such as connected shuttles 
driving in convoys) may be covered by multiple fragment topologies. Lastly, to 
further evaluate applicability of our approach, we intend to apply it to other 
case studies as e.g. the one discussed in [1]. 
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Abstract. Software model checkers are able to exhaustively explore dif- 
ferent bounded program executions arising from various sources of non- 
determinism. These tools provide statements to produce non-determinis- 
tic values for certain variables, thus forcing the corresponding model 
checker to consider all possible values for these during verification. While 
these statements offer an effective way of verifying programs handling ba- 
sic data types and simple structured types, they are inappropriate as a 
mechanism for nondeterministic generation of pointers, favoring the use 
of insertion routines to produce dynamic data structures when verifying, 
via model checking, programs handling such data types. 

We present a technique to improve model checking of programs han- 
dling heap-allocated data types, by taming the explosion of candidate 
structures that can be built when non-deterministically initializing heap 
object fields. The technique exploits precomputed relational bounds, that 
disregard values deemed invalid by the structure’s type invariant, thus 
reducing the state space to be explored by the model checker. Precom- 
puting the relational bounds is a challenging costly task too, for which 
we also present an efficient algorithm, based on incremental SAT solving. 
We implement our approach on top of the CBMC bounded model checker, 
and show that, for a number of data structures implementations, we can 
handle significantly larger input structures and detect faults that CBMC 
is unable to detect. 


1 Introduction 


SAT-based bounded model checking [7] is an automated software analysis tech- 
nique, consisting of appropriately encoding a program as a propositional formula 
in such a way that its satisfying valuations correspond to program defects, such 
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as violations of assertions, uncaught exceptions and memory leaks. Satisfying 
valuations of the obtained propositional formulas can be automatically searched 
for by resorting to SAT solving, exploiting the constant advances in this analysis 
technology. SAT-based bounded model checking achieves full automation in pro- 
gram verification at the cost of completeness: it limits the number of times that 
loops are allowed to be executed to a user provided loop unwinding bound. This 
in turn limits the data that the program can manipulate, which is constrained 
to the program parameters, and what the program can allocate in its bounded 
executions. Hence, although the approach is capable of exploring a huge num- 
ber of execution traces, it cannot prove program correctness due to its bounded 
nature. Nevertheless, it is very useful for bug finding, and is able to support 
fully-fledged higher-level programming languages [8]. 


A tool based on bounded model checking over SAT is CBMC [20]. It supports 
all of ANSI-C, including programs handling pointers and pointer arithmetic. The 
tool is able to exhaustively explore many user-bounded program executions re- 
sulting from various sources of non-determinism, including scheduling decisions 
and the assignment of values to program variables. To achieve this, CBMC pro- 
vides statements to produce non-deterministic values for certain variables, forc- 
ing the model checker to consider all possible values for these variables during 
verification. These statements enable program verification on all legal inputs, by 
assigning these inputs values within their corresponding (legal) domains. While 
this mechanism is effective for the verification of programs manipulating basic 
data types and simple structured types, it is disabled as a feature for the gener- 
ation of pointers. This issue forces the user to provide an ad-hoc environment to 
verify programs handling dynamic data structures. In fact, a typical, convenient 
mechanism to verify programs handling heap-allocated linked structures using 
CBMC and similar tools, is to non-deterministically build such structures using 
insertion routines [19, 22, 11]. 


The aforementioned approach, while effective, has its scalability tied to how 
complex the insertion routines are, and how many of these are actually needed. 
Indeed, there are many linked structures whose domain of valid structures can- 
not be built only via insertion operations (e.g., red-black trees and node caching 
linked lists require insertions as well as removals, in order to reach all bounded 
valid structures). In this paper, we study an alternative technique for verifying, 
using CBMC, programs handling heap-allocated linked structures. The approach 
essentially consists of building a pool of objects with nondeterministically ini- 
tialized fields, which are then used for nondeterministically building structures. 
The rapid explosion in the number of generated linked structures is tamed by ex- 
ploiting precomputed bounds for fields, that disregard values deemed invalid by 
the structure’s assumed properties, such as datatype invariants and routine pre- 
conditions. This leaves us the additional problem of precomputing these bounds, 
a computationally costly task on its own. We then present a novel algorithm for 
these precomputations, based on incremental SAT solving, making the whole 
process fully automated. 
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avl_init(t); 
int size = nondet_int(); 
__CPROVER_assume(size>=0 && size<=MAX_SIZE) ; 
for (int i = 0; i < size; i++) { 
int value = nondet_int(); 
__CPROVER_assume(value >= MIN_VAL && value < MAX_VAL); 
avl_insert(t, value); 
} 
int r_value = nondet_int(); 
__CPROVER_assume(r_value >= MIN_VAL && r_value < MAX_VAL); 
avl_remove(t, r_value); 
__CPROVER_assert (avl_repok(t)) ; 


Fig. 1: Verification of AVL remove, building structures by multiple insertions. 


We perform an experimental evaluation on a benchmark of data structure 
implementations, showing that the use of field bounds contributes significantly 
to improve both memory consumption and verification running times (including 
the precomputations), allowing CBMC to consider larger structures as well as to 
detect faults that could not be detected without their use. 


2 A Motivating Example 


Let us start by describing a particular verification scenario, that will serve the 
purpose of motivating our approach. Suppose that we have an implementation 
of dictionaries, based on AVL trees; furthermore, we would like to verify that 
the remove operation on this structure preserves the structure’s invariant, i.e., 
after a removal is performed, the resulting structure is still a valid AVL tree 
(acyclic, with every node having at most one parent, sorted, and balanced). 
Moreover, let us assume that, besides operation avl_remove, we have AVL’s 
avl_init, avl_insert and avl_repok, the latter being a routine that checks 
whether a given structure satisfies the AVL invariant, as described above. In 
order to perform the desired verification, we can proceed by building the program 
shown in Figure 1. Notice how this program: 


— employs CBMC primitives to nondeterministically decide how many values, 
and which values to insert in/remove from the tree (appropriately con- 
strained by constants MAX_SIZE, MIN_VAL and MAX_VAL), 

— uses an AVL insertion routine to produce the insertions, and 

— uses an avl_repok routine, which checks the AVL invariant on the linked 
structure rooted at t. 


When running CBMC on this program, if loops are unwound enough and 
no violation of the assertion is obtained, then we have verified that, within the 
provided bounds, remove indeed preserves the invariant. 

The above traditional approach to verifying linked structures using CBMC 
and similar tools [19, 22,11] has its efficiency tied to how complex the involved 
routines are, in particular the insertion routine(s) (the avl_remove routine, being 
verified, cannot be avoided). 
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t = nondet_avl(MAX_SIZE, MIN_VAL, MAX_VAL); avilnode* nondet_avl(int size, 

__CPROVER_assume (avl_repok(t)) ; int min_val, 

int r_value = nondet_int(); int max_val) { 

__CPROVER_assume(r_value >= MIN_VAL avinode *n = malloc(sizeof(avlnode) * size); 
&& r_value < MAX_VAL); avlnode *result = NULL; 

avl_remove(t, r_value); if (nondet_bool()) 

__CPROVER_assert (avl_repok(t)) ; // root is null 


return result; 

result = n[0]; // root is nO 

n[0]->left = NULL; 

if (nondet_bool()) 
n[0]->left = n[1]; 

n[0]->right = NULL; 

if (nondet_bool()) 
n[0]->right = n[1]; 

else if (nondet_bool()) 
n[0]->right = n[2]; 


return result; 


} 


Fig. 2: Verification of AVL remove, nondeterministically building linked structs 


An alternative approach, employed by some symbolic execution-based model 
checkers, notably [3,23], consists of creating a pool of nodes, whose fields are 
nondeterministically set, and which are also nondeterministically used to build 
data structures. The process is illustrated in Figure 2. The key is in the use of 
a routine nondet_avl(), which encapsulates the generation of the linked struc- 
ture. A fragment of this routine is shown at the right of Figure 2. Notice how 
this routine will generate invalid structures, e.g., cyclic ones. The avl_repok(t) 
assumption after the generation will take care of disregarding these invalid struc- 
tures for verification. Notice how our manually written example generation rou- 
tine is avoiding to use any node besides n[0] as the root, or any node but n[1] as 
n[0]->left, thus avoiding some isomorphic structures and obvious cycles, but 
it does not avoid nodes from having more than one parent, nor it seems to take 
into account the tree’s balancedness. Of course, we have other alternatives when 
writing the nondeterministic generation routine nondet_avl. We may produce a 
generation routine that, based solely on the fields of the nodes involved in the 
structure and their types, produces all possible structures, leaving the work of 
filtering out valid ones to the assume (avl_repok(t)) part of the program. We 
can also write a sophisticated generation routine specifically tailored for AVL 
trees, that already takes into account (most) invalid values for each node field, 
and thus mostly produces valid structures. The first option has as an advan- 
tage that it is generic, and thus can be made part of an automated verification 
technique, at the cost of being, intuitively, less efficient; the second (and our 
example), on the other hand, has in principle to be manually produced, and is 
more error prone, since we may be disregarding some valid values making the 
verification bounded incomplete, but is intuitively more efficient. 

The technique we present in this paper consists of automatically producing 
the second kind of generation routines. We will start with the first kind of gen- 
eration, and automatically decide which values for each field of each node can 
be safely removed, when we can establish that they do not participate in valid 
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structures (i.e., structures satisfying the corresponding structure invariant). This 
additional problem of deciding when a value for a node field’s domain can be 
safely removed is solved using a novel algorithm, presented in this paper, which 
uses incremental SAT solving. 


3 Tight Field Bounds 


Tight field bounds are based on a relational semantics of structures’ fields in 
program states. The relational semantics of structures is based on interpreting 
a field f at a given program state as the set of pairs (id, v) relating the identifier 
id (representing a unique reference to some data object o in the heap) with the 
value v in the field f of o at that state (i.e., o->f = v in the state). Then, 
each program state corresponds to a set of (functional) binary relations, one per 
field of the structures involved in the program. For example, the program state 
containing the binary tree depicted at the left of Fig. 3 are represented by the 
following relations: 


left = {(NO, N1) (N1, N3)}, right = {(N0, N2) (N1,N4),(N2,N5)} (1) 


For analysis techniques that must consider all possible state configurations 
that satisfy some given property, we may reduce this relational semantics by 
considering tight field bounds. Intuitively, for a field f and a property a, its 
tight field bound on a is the union of f’s representation across all program 
states that satisfy a. Tight field bounds have been used to reduce the number 
of variables and clauses in propositional representations of relational heap en- 
codings for Java automated analyses [14, 13, 2], and in symbolic execution based 
model checking to prune parts of the symbolic execution search tree constraining 
nondeterministic options [15,26] (see section 6 for a more detailed description 
of these previous applications). Tight field bounds are computed from assumed 
properties, and can be employed to restrict structures in states that are assumed 
to satisfy such properties, i.e. precondition states. In our case, we will use the 
invariant of the structure, as opposed to stronger preconditions, so that these 
can be reused across several routines of the same structure. 


Definition 1. Let f be a field of structure T; with type To. Let i and j be the 
scopes for types Tı and To, respectively. Let A = {a1,...,a;} be the identifiers for 
data objects of type Tı, and let B = {b,,...,b;} be the identifiers for data objects 
of type Tz. Given an identifier k, op denotes the corresponding data object. The 
tight field bound for field f is the smallest relation Up C Ax (B+ Null) satisfying: 
(x,y) E Us iff there exists a valid heap instance I in which 0,->f = oy. 


By scope we mean the limit in the number of objects, ranges for numerical 
types, and maximum depth in loop unwinding, as in [17,12]. An important 
assumption we make for analysis is that structure invariants do not refer to 
the specific heap addresses of data objects, and in particular that these do not 
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Fig. 3: Two valid binary trees. 


use pointer arithmetic. Therefore, permuting data object identifiers on a valid 
instance still yields a valid instance (i.e., permuting the actual locations of data 
objects in the heap is irrelevant for invariant satisfaction). This is most times 
the case, and is indeed the case in all the examples that we will present in 
Section 5. This is an important assumption because it enables us to add an 
additional implicit invariant: symmetry breaking. This has an important impact 
in the size of tight field bounds, since they get greatly reduced when isomorphic 
structures are removed. We use a symmetry breaking procedure that removes 
all symmetries. For details, we refer the reader to [14, 13]. 


4 A Technique for Nondeterministic Generation of 
Dynamic Structures 


We are now ready to describe the technique for nondeterministic generation of 
dynamic structures, used to verify programs handling such data using CBMC. 
The technique requires: 


— the program p(T x) to be analyzed; 

— a description of the structure of type T, which in the dynamic case, typically 
consists of a struct or set of structs (that are dynamically allocated); 

— a boolean program repok(T x), that (operationally) decides whether a given 
instance x is valid, i.e., satisfies the structure’s invariant, or not; and 

— a tight field bound By for every field f in T and the scope n to use for 
bounded model checking of p. 


The first three are necessary information; for the last one we present later on in 
the paper an algorithm to compute tight bounds, from the other three. 

The technique starts by building a routine nondet_T(), that produces and re- 
turns structures of type T. The routine works as follows. First, for every (pointer) 
type Tt involved (including T), we start by allocating n (the scope) data objects: 


Tt *tt_nodes = malloc(sizeof(Tt) * n); 


Then, for every structure pointer type Ts (for which we already allocated n 
data objects) and field f of type Tt in Ts, we build the following nondeterministic 
assignment: 
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ts_nodes[0]->f = NULL; 
if (nondet_bool()) ts_nodes[0]->f = tt_nodes [0]; 
else if (nondet_bool()) ts_nodes[0]->f = tt_nodes[1]; 


ts_nodes[1]->f£ = NULL; 
if (nondet_bool()) ts_nodes[i]->f = tt_nodes [0]; 
else if (nondet_bool()) ts_nodes[1]->f = tt_nodes[1]; 


Finally, nondet_T() ends by returning either NULL or t_nodes[0] (no other 
non-null node is necessary, due to symmetry breaking). Using nondet_T(), we 
build the following verification harness for p: 


T x = nondet_T(); 
__CPROVER_assume (repok(x)) ; 
p(x); 

__CPROVER_assert (repok(x)) ; 


Of course the last assertion can be replaced by any expected property of p. 

We now turn our attention to the use of tight field bounds to reduce non- 
determinism in nondet_T(). For every structure Ts and field f with type Tt 
declared in Ts, if (NJ*N7*) does not belong to the tight bound By, then we 
remove from nondet_T() the line: 


if (nondet_bool()) ts_nodes[i]->f = tt_nodes[j]; 


To illustrate the benefits of using tight field bounds in this setting, compare 
the two (semantically equivalent) nondet_av1() methods in Figure 4 for build- 
ing AVLs with size at most 4. At the left of Figure 4, we show the code for 
the approach that considers all the feasible assignments to nodes’ fields within 
the scope (many assignments not displayed due to the lack of space). With 
precomputed tight field bounds we can discard a significant number of these 
assignments, that are not allowed due to the bounds, as shown at the right of 
Figure 4. Notice that, among many others, all self-loops in nodes are discarded 
by the bounds. 


4.1 Computing Tight Field Bounds 


For the rest of this section we assume a fixed structure T, with fields fi,..., fm 
and representation invariant repok, and a fixed scope k. Tight field bounds for 
T can be automatically computed from assumed properties such as invariants 
and preconditions. These properties must be expressed in a language amenable 
to automated analysis, reducible to SAT-based analysis in our case. We employ 
the automated translation of the definition of T and its repok to a propositional 
formula implemented in the TACO tool [14,13]. We also assume a symmetry 
breaking predicate is created by this translation, forcing canonical orderings 
of heap nodes in structures (see [14,13] for a careful description of how these 
symmetry-breaking predicates are automatically built). We discuss below the 
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avinode* nondet_avl() { avinode* nondet_avl() { 
avlnode *n = malloc (sizeof (avlnode) *4) ; avlnode *n = malloc(sizeof (avlnode) *4) ; 
if (nondet_bool()) if (nondet_bool()) return NULL; 
return NULL; avinode *result = n[0]; 
avlnode *result = n[0]; // assignments to n[0]’s fields 
// assignments to n[0]’s fields n[0]->left = NULL; 
n[0]->left = NULL; if (nondet_bool()) 
if (nondet_bool()) n[0]->left = n[1]; 
n[0]->left = n[0]; n[0]->right = NULL; 
else if (nondet_bool()) if (nondet_bool()) 
n[0]->left = n[1]; n[0]->right = n[1]; 
else if (nondet_bool()) else if (nondet_bool()) 
n[0]->left = n[2]; n[0]->right = n[2]; 
else if (nondet_bool()) n[0]->height = 1; 
n[0]->left = n[3]; if (nondet_bool ()) 
n[0]->right = NULL; n[0]->height = 2; 
if (nondet_bool()) else if (nondet_bool()) 
n[0]->right = n[0]; n[0]->height = 3; 
else if (nondet_bool()) // assignments to n[1]’s fields 
n[0]->right = n[1]; n[1]->left = NULL; 
else if (nondet_bool()) if (nondet_bool()) 
n[0]->right = n[2]; n[i]->left = n[3]; 
else if (nondet_bool()) n[1]->right = NULL; 
n[0]->right = n[3]; if (nondet_bool()) 
n[0]->height = 0; n[i]->right = n[3]; 
if (mondet_bool()) n[i]->height = 1; 
n[0]->height = 1; if (nondet_bool() 
else if (nondet_bool()) n[i]->height = 2; 
n[0]->height = 2; // assignments to n[2]’s fields 
else if (nondet_bool()) n[2]->left = NULL; 
n[0]->height = 3; if (nondet_bool()) 
// assignments to n[1], n[2] and n[3]’s n[2]->left = n[3]; 
// fields follow a similar pattern to n[2]->right = NULL; 
// n[{0]’s and are ommited if (nondet_bool()) 
return result; n[2]->right = n[3]; 
} n[2]->height = 1; 
if (nondet_bool()) 
n[2]->height = 2; 
// assignments to n[3]’s fields 
n[3]->left = NULL; 
n[3]->right = NULL; 
n[3]->height = 1; 


return result; } 


Fig. 4: Building AVLs with size at most 4. Left: all feasible assignments to node’s 
fields. Right: only assignments deemed feasible by tight field bounds 


parts of the translation that are important for the understanding of our approach, 
and refer the reader to the literature for additional details [14, 13]. 


Let f be a field of T with type T’. Let A =aj,...,a, and B = by,..., bp be 
the identifiers for data objects of type T and T’ within scope k, respectively. This 
bounded field is then a relation f C A x (B + null). The propositional encoding 
of f consists of boolean variables fij, 0 < i,j < k, such that fij = True 
in a instance J if and only if the value of f for object a; is equal to object 
b; (ie. a;->f = bj) in I (the original translation has variables representing 
a;->f = null, we omit these here to simplify the presentation). 


As an example, Figure 5 below shows the propositional variables representing 
all the feasible values of binary trees’ left and right fields for scope 6, in tabular 
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form. In the tables, object identifiers are named N; (0 < i < 6), variables l; j 
(0 < i,j < 6) denote N;->le ft = N; (similarly, r;,; denote N;->right = N;). 


left | No Ni... N5 right | No Ni... Ns 
No |lo,0 lo,1 --- lo,s No |Yo0,0 70,1 -++ 70,5 
Ni |ia o tigi <- bays Ni |r1,0 T1,1 +++ 71,5 
Ns |l5,0 I5,1 --- 15,5 Ns |75,0 75,1 +++ 75,5 


Fig. 5: Propositional encodings of binary trees’ left and right fields for a scope 
of 6 


In this way, the binary tree at the left of Figure 3, whose relational represen- 
tation is given in equation 1, is defined exactly by setting the following variables 
to true (and all the remaining variables to false): 


left = {lo,1, lis}, right = {ro,2, T1,4, r2,5} (2) 


As each propositional variable in the encoding of a field represents exactly 
the fact that a single pair of objects belongs to the field, in the following we 
will speak of these two notions (propositional variables and pairs of objects re- 
lated by a field) interchangeably. In fact, as our approach operates with propo- 
sitional formulas (needed for exploiting incremental SAT solving), the tight field 
bounds will be represented and computed in terms of propositional variables. It 
is straightforward to see that if variable f;,; belongs to the tight field bound for 
field f, then (a;,b;) is a feasible pair in the relational semantics (and is infeasible 
if fij does not belong to the tight field bound). 

It is worth noticing that deciding if there exists a structure with a particular 
field value, say a;->f = bj, can be accomplished by querying the solver about the 
satisfiability of a formula consisting of a propositional encoding of the structure 
and the invariant (prop_repok), the propositional encoding of the symmetry 
breaking predicate (prop_sbpred), and the corresponding variable f;,;: 


prop_repok ^ prop_sbpred ^ fi,j (3) 


In case the satisfiability verdict is true, the valuation returned by the solver 
corresponds to a valid (in the sense that it satisfies the invariant) memory heap, 
containing pair (a;, bj) in the relational representation of f. Also, from the valu- 
ation we can retrieve for each field f all the (true) variables that represent pairs 
of objects related by f in that particular heap. 

The formula above can be used to compute tight bounds, determining what 
are the infeasible variables f;,; (and hence the corresponding pairs in the fields’ 
semantics), in states that satisfy the invariant. In [14], the infeasible variables 
are determined using a top-down algorithm. In the algorithm therein, the field 
semantics is initially set, for a field of type B declared in structure A, to A x 
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(BU {null}). From this fully populated initial semantics, each pair is checked for 
feasibility. Pairs found to be infeasible are removed from the bound. Adopting 
this top-down approach for computing tight field bounds leads to feasibility 
checks (a large number of these) that are independent from one another, thus 
making it amenable to distributed processing. Moreover, a pair can be removed 
from the bound as soon as it is deemed infeasible, which can be exploited to 
compute tight field bounds “non-exhaustively”, e.g., dedicating a certain time 
to the computation of tight field bounds, and taking the obtained tight field 
bound for improving SAT analysis, regardless of whether the tight bound is the 
tightest (it converged to removing all infeasible pairs) or not. The latter can be 
achieved thanks to the fact that, in the top-down approach, intermediate bounds 
are also tight bounds [14,13]. As each SAT query in this top-down approach 
is independent from the rest, the algorithm does not exploit the incremental 
capabilities of modern SAT solvers. 

Let us present our approach to compute tight field bounds. As opposed to 
the technique in [14], our algorithm operates in a bottom-up fashion. In our 
presentation below, we assume a propEncoding method that takes the repok, 
a symmetry breaking predicate sbpred, and the scopes scope, and returns an 
encoding object. Its getPropositionalFormula method creates and returns a 
CNF propositional formula, encoding the repok and sbpred for the given scope. 
Also, the encoding’s getVars(f) method returns all the propositional variables 
in the encoding of field f (see Figure 5). The algorithm uses an incremental SAT 
solver, represented by a module solver, with the following routines: 


— load: receives as argument a propositional formula in CNF and loads it into 
the solver. 

— addClause: (incrementally) adds a clause to the current formula in the solver 
for future solving invocations. 

— solve: calls the SAT-solving procedure, deciding whether the formula cur- 
rently loaded in the solver is satisfiable (SAT) or not. 

— getModel: if the formula is satisfiable, it returns the valuation produced by 
the SAT-solver. The truth value of a variable v in the model can be retrieved 
by invoking getValue(v). 


The pseudocode of our algorithm is shown in Figure 6. Line 3 builds a proposi- 
tional encoding using the repok, the symmetry breaking predicate sbpred and 
the scopes. The CNF propositional formula produced by the encoding object 
is then loaded into the solver in Line 4. Lines 5-7 initialize sets vars_fj,..., 
vars_f,,, each containing all the propositional variables in the encoding of the 
corresponding fields £1,--- , fm. As opposed to the top-down algorithm proposed 
in [14], which initialized fields’ semantics as binary relations containing all pairs, 
the bottom-up algorithm starts with empty sets feasible f,,...,feasible f,, 
(lines 8-10). feasible_f1, ...,feasible_f,, are used by the algorithm to store 
partial bounds for the corresponding fields f,,--- , fm, and will be iteratively 
extended with the true variables in instances returned by the SAT solver. 

A crucial step in our algorithm is performed at line 12, where the current 
formula loaded in the SAT solver is extended, exploiting incremental SAT solv- 
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ing [16], with a progress-ensuring constraint on heap fields. Here, we add a 
clause that consists of the disjunction of all the variables in the encoding of 
fields that have not been previously added to the feasible_f,,...,feasible_f,, 
sets. The purpose of is to ensure that instances returned by solver.solve() 
in Line 13 contain at least one pair that does not belong to the sets already 
held in feasible f1,...,feasible f,,. Intuitively, by adding the clause in line 
12, the call to solver.solve() in line 13 can be interpreted as “find a valid 
heap instance of the data structure that can be used to extend at least one of 
the current bounds in feasible_f,,...,feasible_f,,,”. If such an instance ex- 
ists, it is returned by the solver.getModel() method, and stored in the model 
variable in line 14. The variables that are true in model are then added to the 
feasible f,,...,feasible f,, sets in lines 15-19. The loop terminates when 
feasible_f1,...,feasible_f,, cannot be augmented any further (lines 20, 21), 
in which case, as we prove below, these sets hold tight field bounds and are 
returned by the algorithm (line 24). 

As an example, assume we are computing tight field bounds for binary trees, 
and that the invocation to solver.solve() returned the instance at the left of 
Figure 3. Then, the variables in sets left and right shown in equation 2 will 
be added to feasible_left and feasible_right, respectively, in lines 15-19. 
Notice that this forces an instance with at least one variable not in the left or 
right sets to be returned by solver.solve() in the next iteration. 

It is worth noticing the importance of the progress-ensuring constraint in 
line 12, being encoded as a clause. This is what enables the possibility of us- 
ing incremental SAT solving [16] in our tight bounds computation. Essentially, 
incremental SAT solvers allow one to append further constraints after each sat- 
isfying valuation is found, as long as these are in CNF. These constraints are 
conjoined with the main (CNF) formula, and used in computing the “next” 
satisfying instance without having to restart the solving process (which is a 
very time consuming process). Also, this allows the solver to exploit the learned 
clauses (that summarize the conflicts found by the solver in the search of satis- 
fying valuations) to help accelerate subsequent queries [10]. Notice that, if the 
new constraints were not in CNF, the whole resulting formula would have to be 
translated to CNF and the SAT process restarted from scratch. 

Theorem 1 proves our algorithm terminates and computes tight field bounds. 


Theorem 1. Algorithm 6 terminates and returns valid tight field bounds. 


Proof. Termination easily follows from the following two facts: (i) for given 
bounds on data domains of the structure under analysis and limited by scopes, 
the number of pairs that can be added to a field bound is a finite number; and 
(ii) each while-loop iteration either adds at least an extra pair to the bounds, 
or otherwise returns unsat, in which case the loop terminates. 

To prove that the algorithm yields tight field bounds, we proceed as follows. 
Notice that at each iteration, and for any field f;, the bound associated to field 
fi (feasible_f;) is a subset of the corresponding tight bound, i.e., contains 
only feasible variables: the initial bound (Ø) is obviously a subset of the tight 
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1 procedure bottom—up(repok, sbpred, scopes) 
2 begin 
3 encoding = propEncoding(repok, sbpred, scopes) 
4 solver.load(encoding. getPropositionalFormula ()) 
5 vars_f; = enconding. getVars(f1) 
6 srs 
7 vars_fm = enconding.getVars(fm) 
8 feasible-fı = {} 
9 — 
10 feasible_fm = {} 
11 while True do 
12 solver .addClause (\V/je1,..,m, v) 
v€(vars_f;\ feasible_f;) 
13 if solver.solve() = SAT then 
14 model = solver. getModel() 
15 feasible_f, = feasible_f; U 
16 {v | v < vars_f; and model. getValue(v) = True} 
17 ae 
18 feasible_fm = feasible_fy U 
19 {v | v < vars_fm and model. getValue(v) = True} 
20 else \\ UNSAT 
21 break 
22 fi 
23 done 
24 return feasible_fi, ..., feasible_fm 
25 end 


Fig. 6: Bottom-up algorithm for tight field bounds computation 


bound, and bounds are extended only by adding variables extracted from valid 
structures (i.e., each loop iteration produces a valid expansion). An inductive 
argument allows us to conclude that, on termination, the bound associated to 
field f; (feasible_f;) is a subset of the tight bound. We will now show that 
feasible_f; is the tight field bound. Let us suppose that, once the algorithm 
terminates, bound feasible_f; is not tight, i.e., there exists a variable vw,z 
that does not belong to feasible _f;. Then, there must exist a canonical (i.e., 
satisfying symmetry breaking) instance I of repok within scopes, in which 
Ow->f; = 0z. Therefore, I satisfies repok, sbpred, and vw,» = True, contradict- 
ing the fact that the algorithm had terminated. Therefore, all variables excluded 
from feasible_f; are infeasible, making this bound tight. 


As opposed to the top-down algorithm for tight bounds introduced in [14, 13] 
Algorithm 6 only provides useful information once it terminates — intermediate 
bounds cannot be used to improve analysis. Moreover, whereas the top-down 
approach lends itself well to parallelization (as we mentioned before, it implies 
a large number of independent SAT queries, that can be solved in a distributed 
manner), it is not obvious how one would reasonably distribute our new bottom- 
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up computation. Nevertheless, as we will show in Section 5, the sequential Al- 
gorithm 6 and its optimizations (i.e. the usage of incremental SAT-solving) are 
substantially faster than the parallel, distributed, top-down approach. 


5 Evaluation 


Our first experimental evaluation assesses the impact of tight field bounds in 
verification of code handling linked structures using CBMC. The evaluation is 
based on a benchmark of collection implementations, previously used for tight 
field bounds computation in [14,13], composed of data structures with increas- 
ingly complex invariants: 


— an implementation of sequences based on singly linked lists (LList); 

— a List implementation (from Apache Commons.Collections), based on circu- 
lar doubly-linked lists (AList); 

— a List implementation (from Apache Commons Collections), based on node 
caching linked lists (CList); 

— a Set implementation (from java.util) based on red-black trees (TSet); 

— an implementation of AVL trees obtained from the case study used in [4] 
(AVL); and 

— an implementation of binomial heaps used as part of a benchmark in [28] 
(BHeap). 


Experiments in this section were run on workstations with Intel Core i7 
4790 processor, 8Mb Cache, 3.6Ghz (4 Turbo), and 16 Gb of RAM, running 
GNU/Linux. The incremental SAT solver used was Minisat 2.2.0. We denote 
by OOM that the 16GB of memory were exhausted, and by OOM+ that the 
16GB where exhausted while CBMC was preprocessing; in this latter case no 
numbers of clauses or variables were produced by CBMC. Timeout was set for 
these experiments to 1 hour. 

Table 1 reports, for the most relevant routines of each of the data structures in 
our benchmark, the verification running times with the underlying decision pro- 
cedure running times discriminated in seconds, as well as the number of clauses 
and variables (expressed in thousands) in the CNF formulas corresponding to 
each of the verification tasks, for several scopes (S). Since we checked whether 
the routines preserved the corresponding structure’s invariant, we did not con- 
sider for the experiments those routines that did not modify the structure (these 
trivially preserve the invariant). We assessed three different approaches: 


— Build*: use of verification harnesses based on insertion routines (see Fig. 1), 

— Gen&Filter (generate and filter): non-deterministic generation of data struc- 
tures without tight field bounds (as illustrated in Fig. 4), using a traditional 
symmetry breaking algorithm to discard isomorphic structures [14] (we do 
not discuss this here due to space reasons), 

— TFB: our introduced approach, which incorporates tight field bounds into the 
previous to discard irrelevant non-deterministic assignments of field values 
(as illustrated in Fig. 4). 
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Some remarks on the results are in order. Table 1 shows that in all analyzed 
routines, the TFB approach allowed us to analyze larger scopes for which the 
other input generation techniques exhausted the allotted time or memory. TFB 
was able to analyze larger scopes than Gen&Filter in 7 out of 12 cases (remark- 
ably, by at least 6 in AList, at least 3 in CList and at least 2 in AVL), and in 
8 out of 12 cases with respect to Build* (by at least 4 in all 8 cases). Routine 
extractMin in structure BHeap is particularly interesting: it contains a bug first 
found in [14] that can only be exhibited by an input with at least 13 nodes. 
Gray cells mark experiments in which the bug was detected by CBMC. Notice 
in particular that Build* does not scale well enough to find this bug. 

Our second evaluation is devoted to tight field bounds computation, in com- 
parison with the top-down approach presented in [14]. We re-ran the TACO 
experiments as reported in [13] on the same hardware we used for our own 
experiments for a fair comparison. Original scripts and configurations were pre- 
served. All distributed experiments were run on a cluster of 9 PCs (one being the 
master) of the same characteristics as described above. Each distributed exper- 
iment was run 3 times; the reported timing is the average thereof. All times are 
given in wall-clock seconds. A timeout (TO) is set at 18,000 seconds (5 hours), 
for tight bounds computation. Our bottom up tight field bounds technique is 
non-parallel, and was run on a single workstation. Table 2 summarizes the re- 
sults of our experiments regarding tight bounds computation. We compared the 
running times of computing tight field bounds using the distributed technique 
from [14] and our non-parallel presented algorithm, for scopes 10, 12, 15, 17 and 
20, reporting the following: 


— TACO(||): The parallel wall-clock time required to compute tight field bounds 
with TACO, the tool subsuming the top-down tight bounds approach [14, 13]. 

— TACO(s): The TACO sequentialized time, i.e., the sum of times over all the 
Minisat solvings performed by the TACO distributed algorithm. 

— BU: The time the bottom-up algorithm (Alg. 6) requires to compute tight 
field bounds. 

— speedup(||): The speed-up achieved by BU when compared to the distributed 
TACO time reported as TACO(||). 

— Speedup(s): The speed-up achieved by BU when compared to the sequen- 
tialized TACO time reported as TACO(s). 


The speed-ups obtained by Alg. 6 are, in comparison with the distributed 
approach in [14], in general very good. In particular, in all experiments but AVL 
with scope 20, the running time of our sequential bottom-up approach (BU) is 
already below the wall-clock time of (parallel) TACO. For AVL trees with scope 
20, the only experiment where BU performed slower than TACO, the achieved 
speed up is 0.6X. This means that running BU on a single workstation does not 
even take twice as long as running TACO(||) on 32 processors (4 cores in 8 slave 
machines used for distributed computation). Second, it is worth noticing that 
structures with strong invariants (e.g., BHeap) intuitively lead to “small” tight 
field bounds; a bottom-up approach then, as we explained earlier, is particularly 
well suited for tight bounds computation for these structures, since the process 


232 P. Ponzio et al. 


Table 1: Dynamic data structure verification in CBMC: TFB versus Build* and 
Gen& Filter. Verification and solving times in seconds, clauses and variables in 
thousands 


Routine] S Build* Gen&Filter TFB 
Time(Solv) Clauses Vars| Time(Solv) Clauses Vars|Time(Solv) Clauses Vars 
insBack |18 10(5) 705 2,236 11(5) 248 ,157 10(4) 188 916 
5 19 12(6) 797 2,524 13(6) 275 288 11(4 206 1,015 
E 20 14(7) 898 2,836 16(7) 303 428 13(5 226 1,122 
H remove [18 10(6) 629 2,004 14(9) 247 ,154 11(6) 201 967 
19 13(8) 715 2,274 23(16) 275 288 13(7 221 1,075 
20 14(9) 809 2,567 20(12) 303 ,431 15(8 243 1,190 
addLast |13 2(1) 146 628 9(7) 235 947 3(2 184 738 
2 14 2(1) 164 704 TO 267 ,082 3(2 206 827 
Q 20 6(4) 292 1,285 = = = 8(6 357 1,459 
p remind [14 5(3) 255 1,099|1168(1166) 352 444 8(6) 307 1,270 
15 6(5) 287 1,238 TO 400 ,645 10(8 346 1,431 
20 17(14) 471 2,058 = = = 27/24 568 2,387 
addLast | 6f 407(402) 2,471 9,937 2(1) 109 430 1(1) 103 402 
7 TO 3,754 15,158 2(1) 133 527 2(1 122 482 
17 = = —|1423(1419) 527 2,188 10(7 411 1,692 
p 18 = = = TO 583 2,425 10(7 449 1,853 
Q 20 = = = = = = 14(10) 530 2,195 
5 remove 6| 490(486) 1,750 6,994 4(3) 258 1,002 4(3) 247 958 
7 TO 2,755 11,066 8(3) 356 1,395 5(4 332 1,298 
15 = = —|2820(2812) 2,151 8,642 60(52 1,768 7,103 
16 = = = TO 2,537 10,202 102(93 2,067 8,315 
20 = = = = = —| 219(201) 3,578 14,454 
insert 1| 114(105) 13,724 58,613 19(17) 2,232 9,593 7(5) 712 3,006 
2 OoM+ = = 69(62) 7,011 30,138 21(15 1,665 7,125 
3 = = —| 203(182) 19,230 82,602 57(38 3,414 14,745 
4 = = a OoM+ = -| 169(114) 7,005 30,500 
5 = = m = = —| 411(266) 13,818 60,663 
z 6 - - = -— - = OoM 26,981 119,475 
< delete 2 94(87) 10,874 46,271 TL(T0) 1,227 5,257 4(3) 421 1,777 
3 OoM+ 39(34) 3,844 16,491 11(8 823 3,487 
4 = 5 —| 118(105) 10,522 45,120 40(29 2,351 10,058 
5 = = = OoM 26,768 114,697 108(81 3,823 16,387 
7 = = = = = —| 1171/1011 14,365 62,154 
8 = = = = = = OoM+ = = 
add 1| 158(104) 11,849 43,627 20(16) 2,093 8,076 10(7) 980 3,811 
2 OoM+ = = 63(62) 5,609 21,908 37(27 3,043 11,961 
4 = = -| 362(314) 23,538 93,122| 206(160) 12,042 47,960 
5 = = = OoM+ = -| 386(305) 18,270 73,206 

3 6 - ~ - - - -| OoM+ 

H remove 2 128(85) 9,708 34,998 10(7) 1,029 3,934 9(7) 1,000 3,825 
3 OoM 28,268 107,255 23(19) 2,074 8,039 22(17) 2,016 7,818 
9 = = = 828/761 22,881 91,143 760/699 22,698 90,434 
10 = = = OoM 29,724 118,620 OoM 29,548 117,943 
insert 5| 188(181) 17,620 75,858 35(32) 2,722 11,679 32(28) 2,627 11,297 
6 OoM 27,217 117,396 56(50) 3,852 16,522 47(42) 3,717 15,972 
a 13 = = = 640/603 18,480 78,795 523/487 17,827 76,142 
§ 14 = = = OoM 21,645 92,214 OoM 20,967 89,456 
a extrMin | 5| 157(152) 14,713 63,329 26(23) 2,015 8,603 24(21) 1,930 8,267 
6 OoM 23,511 101,429 44(39) 3,022 12,905 40(35) 2,914 12,474 
12 = = —| 487(459) 14,254 60,615] 441(414) 13,921 59,285 
13 = = = 576 17,094 72,634 535 16,711 71,102 


of computing bounds by discovering and adding new elements to a partial bound 
until nothing new can be discovered, quickly converges to termination in these 
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Table 2: Tight field bounds computation times and achieved speed-ups. 


LList S10 S12 S15  Si7 S20 
TACO(||) 75 10.7 29.0 426 66.1 
TACO(s) 122.0 231.8 777.4 1204.7 1932.5 
BU TB 20 34 B3 | 11.5 


speedup(||) 5.7X 5.3X 83X 8.0X 5.7X 
speedup(s) 91.7X 114.1X 224.6X 227.7X 166.8X 
AList S10 S12 S15 S17 S20 
TACO(||) 15.9 298 73.0 120.3 2174.8 
TACO(s) 381.4 807.9 2153.8 3638.0 67936.0 
speedup(||) 7.6X 11.8X 14.7X 13.9X 134.9X 
speedup(s) 184.2X 319.3X 435.9X 423.0X 4217.0X 
CList S10 S12 S15 S17 S20 
TACO(||) 35.6 64.2 110.7 176.3 4634.6 
TACO(s) 981.1 1881.9 3331.1 5386.0 145106.0 
speedup(||) 14.6X 14.0X 9.2X 3.2X 1.6X 
speedup(s) 402.0X 410.8X 276.8X 98.7X  51.2X 
AVL S10 S12 S15 S17 S20 
TACO(||) 64.6 141.9 465.9 2437.7 5939.5 
TACO(s) 1893.7 4323.3 14645.6 77536.6 187161.0 
BU > 8.1 23.0 111.4 1078.0 8562.2 
speedup(||) 7.8X 61X 41X 2.2X 0.6X 
speedup(s) 231.2X 187.3X 131.4X 71.9X 21.8X 
TSet S10 S12 S15 S17 S20 
TACO(||) 76.0 145.6 258.2 872.8 2335.4 
TACO(s) 2434.9 4411.4 8005.7 27538.8 74134.6 
BÙ > 48 10.3 39.1 168.6 527.6 
speedup(||) 15.6X 14.0X 65X 5.1X 4.4X 
speedup(s) 458.9X 425.4X 204.4X 163.2X 140.4X 
BHeap S10 S12 S15 S17 S20 
TACO(||) 115.9 188.3 345.0 1119.7 3224.0 
TACO(s) 3505.6 5747.1 10759.1 35409.9 102496.0 
BU = 44 91 23.8 80.7 243.9 
speedup(||) 26.0X 20.4X 14.4X 13.8X  13.2X 
speedup(s) 786.0X 625.3X 452.0X 438.6X 420.1X 


cases. Third, some structures with relatively weak invariants also had good run- 
ning times (AList, in particular), when compared to other case studies. Although 
the invariants in these cases are weaker, which intuitively would lead to more 
expensive tight bounds computations, these structures have fewer fields, so the 
state space to be covered to compute tight bounds is significantly smaller than 
that of more complex structures. 

All the experiments in this section can be reproduced following the instruc- 
tions available at [1]. 
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Threats to Validity. Our experimental evaluation is limited to data structures. 
From the vast domain of data structures, we have selected a few ones that we 
consider representative for several reasons: they are often used as case studies in 
the evaluation of other software analysis tools [6, 9, 18, 28], their invariants have 
varied complexity (which is a dimension that affects tight bounds’ size, and thus 
their computation), some are acyclic and others are not (which shows that the 
encoding we make in CBMC is quite general), etc. We consider this is a good 
menu, representative of a wider class of data structures. 

Our approach to capture both the Build* and Gen&Filter strategies might 
have accidentally favored our technique. We tried different alternatives for cap- 
turing Build* and Gen&Filter, in particular with different ways of writing the 
repOK routines (which affected running times). We took the best alternative 
found for each case, to perform the comparison. In the case of Build*, we took 
the smallest number of builder routines that guaranteed producing all (bounded) 
structures, since this is a factor that impacts running times. All structures with 
the exception of CList and TSet required just the add routine, while these two 
also needed a remove routine, to guarantee generation of all structures. 

Regarding variance across cluster runs, different schedulings indeed yield 
slightly different timings. Since the granularity of individual analyses is fine, 
differences are typically small. However, they grow with the scope (e.g., usually 
smaller than 5% for scope sizes below 10, but up to 15% for the largest sizes). 
We used the average of 3 runs to reduce the effect of variance in the experiments. 

Finally, we did not prove our implementations correct, so our results may 
be affected by errors in our implementations. We checked consistency of the re- 
sults across different techniques and tools to confirm that bounds were correctly 
computed, and verification was bounded complete in all cases. 


6 Related Work 


Automated analysis of code handling dynamic data structures has been the fo- 
cus of various lines of research, including separation logic based approaches [5], 
approaches based on combinations of testing and static analysis [22], various 
forms of model checking including explicit state model checking [27], symbolic 
execution based model checking [23] and SAT-based verification [14,13]. The 
approach that we refer to as Build*, producing nondeterministic structures by 
using insertion routines, has been used in some of these approaches, including 
[22,11]. The “generate & filter” mechanism, on the other hand, is more often 
employed in modular (assume-guarantee) verification. In particular, the lazy ini- 
tialization approach, whose symmetry breaking we borrowed for “generate & fil- 
ter” in this paper is used in [19], among others. However, in SAT-based bounded 
model checking, with tools such as [20], “generate & filter” is not reported as 
an analysis option for dynamic data structures. The use of tight bounds to im- 
prove analysis has been used previously to improve test generation and bounded 
verification for JML-annotated Java programs [14,13]. The setting is however 
different from that of CBMC, due to the relational program (and heap state) se- 
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mantics, which enabled them to exploit tight bounds directly at the propositional 
encoding level. Tight bounds have also been used for improving symbolic exe- 
cution based model checking [15, 26]. Again, the context is different, since these 
approaches that essentially “walk” the code (either concretely or symbolically), 
can exploit tight bounds more deeply [26], also obtaining greater profits. 


We have also reported a novel technique to compute tight bounds. This al- 
gorithm is inspired in the work of [24] about black-box test input generation 
using SAT. Our work is also closely related to [14,13]. The approach to com- 
pute tight field bounds presented in [14,13] as part of the TACO tool, performs 
a very large number of independent SAT queries to compute bounds, and thus 
requires a cluster of workstations to do so effectively (we compared with this 
approach in the paper). Another alternative approach to compute tight field 
bounds is presented in [25], but requires structure specifications to be provided 
in a Separation Logic flavor [21] to compute field bounds. 


7 Conclusions 


We have investigated the use of tight field bounds in the context of SAT-based 
bounded model checking, more concretely, in (assume-guarantee) verification of 
C code, using CBMC. We showed that, in this context, and in particular in 
the verification of programs dealing with linked structures, an approach based 
on nondeterministically generating structures, and then “filtering out” ill-formed 
ones, can be more efficient than the more traditional approach of repeatedly using 
data structure builders, especially when tight bounds are exploited. We have 
performed a number of experiments that confirm that this alternative approach 
allows CBMC to consider larger input sizes as well as to detect bugs that could 
not be detected without using bounds. 


Since the approach depends on precomputing tight field bounds, we have also 
studied this problem, providing a novel algorithm for tight field bound compu- 
tation. Tight field bounds have proved useful for a number of different analyses, 
but computing them is costly, and previous field bound computation approaches 
that performed reasonably did so at the expense of relying on a cluster of work- 
stations to perform the task, or were only applicable to a limited set of class 
invariants, expressible in separation logic. Thus, while tight field bounds proved 
to have a deep impact in the previously mentioned automated software analysis 
techniques, their use has been severely undermined by the necessity of a cluster 
of computers for their effective computation, or the availability of specifications 
in separation logic. The algorithm presented in this article allows one to compute 
tight field bounds on a single workstation more efficiently than the distributed 
approach on a cluster of 8 quad-core, and therefore makes tight field bounds 
computation both practical and worthwhile, as part of the above mentioned 
analyses. 
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Abstract Static analysis frameworks, such as Soot and Wala, are used 
by researchers to prototype and compare program analyses. These frame- 
works vary on heap abstraction, modeling library classes, and underlying 
intermediate program representation (IR). Often, these variations pose 
a threat to the validity of the results as the implications of comparing 
the same analysis implementation in different frameworks are still un- 
explored. Earlier studies have focused on the precision, soundness, and 
recall of the algorithms implemented in these frameworks; however, little 
to no work has been done to evaluate the effects of program represen- 
tation. In this work, we fill this gap and study the impact of program 
representation on pointer analysis. Unfortunately, existing metrics are 
insufficient for such a comparison due to their inability to isolate each 
aspect of the program representation. Therefore, we define two novel 
metrics that measure these analyses’ precision after isolating the influ- 
ence of class-hierarchy and intermediate representation. Our results es- 
tablish that the minor differences in the class hierarchy and IR do not 
impact program analysis significantly. Besides, they reveal the sources of 
unsoundness that aid researchers in developing program analysis. 


Keywords: Pointer Analysis, Java, Program Analysis, Empirical Studies 


1 Introduction 


Researchers have proposed various approaches to enhance the precision and 
soundness of static analyses [6,9, 10,14, 17, 26,30,31]. They use program analy- 
sis frameworks to prototype and evaluate their algorithms. A program analysis 
based on declarative specifications (a growingly popular implementation para- 
digm) uses these frameworks to extract fundamental dataflow relations and feeds 
them as the ground facts to a Datalog engine. 

Program analysis frameworks, primarily Soot and Wala, are being increas- 
ingly adopted in program analysis [11,31,40]. These frameworks provide APIs, 
which abstract internal program representation. However, program representa- 
tion in these frameworks is heterogeneous in many aspects. A few of those are: 
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— Intermediate Representation (IR). The intermediate language for program 
representation is an abstraction of the object code (bytecode) or source code. 
It removes syntactic sugar from the language and transforms it into a (mini- 
mal) core language. Thus, analysis developers can focus on the core language 
features to define their analysis. 

— Modeling of libraries in analysis scope. Real-life applications are seldomly 
developed from scratch; instead, they reuse library modules. Whole-program 
analyses consider these libraries for soundness in terms of the class-hierarchy, 
which forms the analyses’ scope. Users can tune the scope to favor scalability 
over soundness. 

— Heap Modeling. Heap modeling is the technique to model dynamic heap al- 
location statically. Precise heap modeling is undecidable; therefore, analyses 
use approximations to keep it decidable [20]. Apart from these approxima- 
tions, optimization may choose to keep a low memory footprint at the cost 
of precision and soundness. 

These factors influence the precision, scalability, soundness of the analyses, 
and at the same time, impede a fair comparison of analyses. Earlier research 
(Spath et al. [29]) was concerned about the validity of results when comparing 
two analyses frameworks. Reif et al. consider the comparison of different frame- 
works “bogus” [21] at the outset. Although many earlier works have proposed 
techniques to enhance scalability and precision, little to no work was done on 
how program representation influences program analyses. As a result, a com- 
parison of new analyses with existing analyses suffers from a threat to validity 
that might have been overlooked. In this work, we fill the gap with an empirical 
study of these aspects of program analysis frameworks. 

We choose pointer analysis for this study. Pointer analysis computes the heap 
locations referred by program variables and builds the foundation for many oth- 
ers, such as alias analysis, type-state, or program slicing. To evaluate interme- 
diate representation and library modeling, we choose Doop, a state-of-the-art 
pointer analysis framework and compare its analysis for different frontends. For 
the third aspect, heap modeling, we compare the pointer analysis of Wala’s (a 
state-of-the-art program analysis) framework with Doop using Wala’s frontend, 
i.e., leveraging the identical intermediate representation. 

A challenging aspect of this work is that the existing notions of precision 
for pointer analysis were insufficient. The computation of these metrics does 
not isolate single aspects of pointer analysis but rather combines all effects. 
For example, the average points-to set size is influenced by all three of the 
aforementioned aspects. It is difficult to determine the effect of each aspect by 
only looking at the score. In this work, we counteract this problem by introducing 
metrics that isolate a particular aspect under study and nullifies the effect of 
others. Therefore, we define two novel metrics in section 3.1, one for measuring 
the effects of libraries to enable a fair comparison among frameworks. To the 
best of our knowledge, it is the first study that evaluates the impact of program 
representation on pointer analysis. Precisely, in this paper, we make the following 
contributions: 
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— We defined two metrics for evaluating each aspect in isolation, one for mod- 
eling of library classes, the other for IR. 

— We evaluated the differences in library modeling and found that these have 
little influence on program analyses. Additionally, we discovered sources of 
unsoundness in these frameworks. 

— We evaluated the precision for different IRs and found that they have no 
impact on the precision of virtual method call elimination. 

— We empirically found differences in heap abstractions even for analyses claim- 
ing the same levels of context-sensitivity regarding the types of heap objects. 
In summary, our empirical study dispels the threats to the validity of the 

results of existing works posed by these differences of frameworks. It also dis- 
covers novel sources of unsoundness and imprecision in existing frameworks that 
provide suggestions that users/developers of these frameworks could incorporate 
into their analyses. Although we focus on pointer analysis in the paper, our re- 
sults are, in principle, generalizable to many other static analyses, as the findings 
presented in this paper also hold for these. We have made the artifacts available 
on https://github.com/jpksh90/pointeval to facilitate reproduction. 


2 Background and Motivation 


The goal of pointer analysis is to determine which objects a variable may refer 
(point) to at runtime. A points-to set is a static approximation of this question, 
which maps variables to objects that are allocated on the heap (heap objects). 
More precisely, if V is the set of variables in a program, and H is the set of heap 
objects, then points-to : V —> P(H). points-to(v) returns the set of heap objects 
in H referred by v. 

Doop is a framework that exclusively focuses on pointer analysis, defines 
the analysis’ inference rules in Datalog [41], and is in active development. It 
supports tuning of the analysis to adapt for various factors of precision (and 
scalability). Doop leverages the program synthesizer Soufflé [12,22] to resolve 
points-to according to the inference rules and the ground facts, which are derived 
directly from the program. 

Wala [37] and Soot [28] are general-purpose program analyzers providing 
some pre-defined analyses and APIs for the development of custom analyses. 
Wala comes with various pre-defined pointer analyses [39], some of which feature 
novel optimizations to enhance scalability. 

A context-sensitive analysis improves a pointer analysis’ precision by discern- 
ing method calls based on their calling contexts. Popular notions of contexts 
are based on method callsites [23] (callsite-sensitive), invoking objects (object- 
sensitive) [19], or hybrids thereof [13]. 

In the sequel, we explain the need for this study by exemplifying the three 
factors that influence the results of pointer analyses. 
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Listing 1.1: Factory Method 


1public class Factory { 


2 public static void main(String args []) { 

3 AInt a = AInt.getInstance (5); 

4 AInt b = AInt.getInstance(7); } } 

sclass AInt { 

6 private Integer a; //... getter, setter and constructor 
7 public static AInt getInstance(int x) { 

8 return new AInt(x); //allocation a@8 

9 }} 


Listing 1.2: Soot IR for the main method in Listing 1.1 


1public class Factory extends java.lang.Object { 
//constructor 
public static void main(java.lang.String[]) { 
java.lang.String[] r0; 
AInt ri, r2% 
rO := @parameterO: java.lang.String []; 
ri = staticinvoke <AInt: AInt getInstance(int) >(5); 
r2 = staticinvoke <AInt: AInt getInstance(int) >(7); 
return; } } 


oon an A WN 


2.1 Intermediate Representation 


Many program analyses tools leverage an intermediate representation (IR) in- 
stead of the actual source or bytecode for analysis. IRs remove syntactic sugar 
from the source code and make it amenable to analysis by focussing on the fun- 
damental operations. Popular strategies for IR generation are based on three- 
address code or static single assignment (SSA) form [4]. By default, the Soot 
framework uses a three-address-based IR (Jimple) [35], while Wala uses a SSA- 
based IR [38]. Both IRs are register-based [36,38], and hence introduce synthetic 
variables to mimic the stack-based Java bytecode. Doop can be configured to 
leverage either Jimple or Wala’s IR as a frontend for program representation. 

Consider the code example in Listing 1.1 and its Jimple IR depicted in List- 
ing 1.2. The main method declaration (line 2) translates to the almost identical 
line 3 in the IR, whose parameter is translated to the variable @parameter0 
(line 6). Due to the additional local variable r0 (line 4), the single main method 
argument translates to two variables in the IR. The invocations of the static 
method getInstance (lines 3 and 4 of Listing 1.1) are translated to the corre- 
sponding operation code staticinvoke with the method name and arguments. 
The newly allocated objects returned from these factory method invocations are 
stored in the variables r1 and r2. 

Wala’s IR generation differs significantly from Soot (see Listing 1.3). As an 
SSA-based IR, it does not assign names to method parameters and variables 
but ordinal numbers (starting from ‘1’) called variable numbers (we prepend 
‘v’ to these numbers for clarity). Thus, the receiver object (this reference in 
Java), or the first parameter in the case of a static method is (silently) assigned 
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Listing 1.3: Wala IR for the main method in Listing 1.1 


1 Factory.main([Ljava/lang/String;)V 

25 = invokestatic < Application, LAInt, 
getInstance(I)LAInt; > 3 @1 exception:4 

38 = invokestatic < Application, LAInt, 
getInstance(I)LAInt; > 6 @7 exception:7 

4return 


Listing 1.4: Snapshot of pointer analysis results from Doop with different IR 


1// Variables in main method with ****Wala**** 

2< <<main method array>> <Factory: void 
main(java.lang.String[])>/vi 

3// Variables in main method with ****Soot **** 


4> <<main method array>> <Factory: void 
main (java.lang.String [])>/@parameter0 
5> <<main method array>> <Factory: void 


main(java.lang.String[])>/10#_0 


the number vi. Further method parameters are assigned subsequent variable 
numbers, succeeded by local variables. Again, the static method calls to the 
method getInstance are translated to invokestatic, where v3 and v6 hold the 
(implicitly defined) constant arguments 6 and 7. The objects returned from the 
factory method invocations are stored in the variables v5 and v8. Potential 
exceptions thrown in the invoked methods are stored in v4 or v7, respectively. 

The differences in program representation influence the metrics of pointer 
analysis: We analyzed Listing 1.1 context-insensitively with Doop, using Jimple 
and Wala’s IR. The results are shown in Listing 1.4: The main method parameter 
object «main method array» is referred by one variable in Wala (line 2) but 
two variables in Soot (lines 4- 5). Even though the average points-to set size is 
1 for all variables in Listing 1.4, we found noticeable differences in the average 
points-to set sizes in other program’s analyses, with Soot’s frontend the average 
size of the points-to set being 2.07 for 3328 variables, and 1.95 for 2298 variables 
using Wala’s—Jimple again created more variables than Wala. These subtle 
differences in program representation affect the average points-to set size, and 
it is unclear whether these two numbers are in fact comparable. In this work, 
we aim to investigate the impact of IRs on the precision and scalability of the 
analysis (Section 4.3). 


2.2 Static modeling of libraries 


As a whole program analysis, a pointer analysis does not only requires knowledge 
of the program to be analyzed but also the library classes, especially those related 
to the runtime. For example, a whole program analysis of a Java application 
would require the runtime libraries, such as those in rt.jar, and other dependent 
libraries, bundled with the application. Analysis frameworks such as Soot and 
Wala construct the class hierarchy based on all classes present in libraries and the 
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application. They can also remove “irrelevant” classes, favoring scalability over 
soundness. Interestingly, we found cases where some frontends do not load all of 
the required classes, which induces discrepancies when comparing the analyses. 

Consider the program shown in Listing 1.1. To corroborate our intuition, 
we analyzed this program context-insensitively with Soot’s and Wala’s front- 
ends. Using the former front-end, Doop loads 3,837 classes and computes the 
analysis with an average points-to set size of 2.07. With Wala’s front-end, it 
loads 19,927 (~5x) classes for analysis with an average points-to set size of 
1.95. Further investigating the types of heap objects, we found that Doop with 
Wala’s IR contains objects of the class java. security. PrivilegedAction Exception, 
which is absent in the analysis with Soot. Note that our simple program contains 
no instance of that type, so it must stem from analyzing libraries. In another 
instance, Soot loads the classes from javax.crypto, whereas Wala does not. In this 
research, we examine the imprecise modeling and discover possible implications 
on precision and soundness (sections 4.1 and 4.2). 


2.3 Heap Abstraction 


Heap abstraction is an important aspect of pointer analysis and determines how 
object allocations are statically represented in the analysis. One simple approach 
is to create a unique representation for each object allocation site in the pro- 
gram (allocation site abstraction). However, at runtime allocation sites can be 
executed more than once, creating several objects that are then represented by 
the same abstract value. As an example, consider the object allocation (line 8) 
of Listing 1.1, represented via a single abstract object, say a@8. In the main 
method the newly allocated objects returned by get/nstance are captured by the 
variables a and b, which would both refer to the abstract object, a@8 in the 
result of the pointer analysis. Thus, a and b are spuriously considered aliases 
(ie., refering to the same object.) This imprecision stems from ignoring the 
calling-context of getInstance (context-insensitive heap abstraction). 

A contezt-sensitive heap abstraction (a.k.a heap cloning) discerns the ab- 
stract? heap-objects based on the calling context, associating the calling context 
with the heap object to distinguish the allocations in a pair (allocation site, 
call stack). Thus the allocation at line 8 is represented as two heap objects, 
(a@8, 3) and (a@8, 4). Without loss of generality, the length of the call stack 
can be increased to any finite number, lest the analysis be undecidable. All 
state-of-the-art pointer analysis frameworks offer context-sensitive heap abstrac- 
tion with a finite context length. 

The discussion above demonstrates how the choice of heap abstraction can 
(potentially) influence pointer analysis. Therefore, in this work, we study the 
frameworks’ heap abstractions. We conducted a preliminary study to gain ini- 
tial insights and to validate our intuition, and context-sensitively analyzed List- 
ing 1.1 with a one-call-site context-sensitivity in Doop with Wala’s IR, and the 
one-call-site sensitive analysis of the Wala framework. Both of these analyses 
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use a context-sensitive heap abstraction with context length of one. In spite of 
that, Wala creates 17 objects while Doop creates 133 objects (~7x). The av- 
erage points-to set size varies between 1.55 for the analysis provided by Wala 
and 1.62 for Doop with Wala’s IR*. Thus, we can see that even with the same 
level of sensitivity in heap abstraction (and IR), analysis results depend on 
the framework used. Manual inspection revealed that Wala selectively uses the 
context-sensitive heap abstraction, applying contextual heap abstraction only 
to non-library classes while treating the library’s objects context-insensitively. 
Out of the 17 heap objects, Wala uses context-sensitivity for only 6 objects. In 
contrast, Doop leverages context-sensitivity for all heap objects, including the 
library’s objects. These initial insights motivated us to analyze the influence of 
heap abstraction on precision and scalability in more detail in Section 4.4. 

To summarize, the parameters for program analysis such as IR (Section 2.1), 
static modeling of libraries (Section 2.2), and heap abstraction (Section 2.3) 
affect the precision and scalability of a pointer analysis. Based on initial insights, 
we analyze the influence of the mentioned parameters using different frameworks, 
frontends, and on a larger and diverse set of benchmark applications. 


3 Methodology 


3.1 Metrics Used 


The precision of a pointer analysis has been defined in numerous ways in the 
literature. Some of the metrics for precision available in the literature are the 
average size of the points-to sets, the number of call-graph edges, and the number 
of resolved virtual calls. These metrics are not clearly superior to one another 
but rather tailored to specific clients, for example, the latter is leveraged by 
compilers in devirtualization of virtual method calls. 

All of these metrics reflect how precisely the analysis computes the points-to 
sets (sets of heap objects referred by a variable). For example, whether or not a 
virtual call can be resolved depends on the heap objects’ types in the points-to 
set of the target variable. If there is only one type (or subtypes thereof that do 
not redefine the virtual method) then the virtual call is resolvable. Therefore, 
the precision of a client analysis depends on how precisely the points-to set for 
each variable in the program can be resolved, in other words, how low the value 
of the average points-to set size is. An average size close to one is considered the 
hallmark of pointer analysis [27]. 

Therefore, we leverage the wide-spread metric of average points-to set size for 
our evaluation, i.e., the ratio of the total sizes of the points-to sets to the total 
number of local variables [26,34]. It permits a client-agnostic comparison of the 
pointer analysis, which generalizes our evaluation results to any specific analysis. 
We refer to the average points-to set size as precision in this paper. Note that the 
actual precision of the analysis is inversely connected to the average points-to 


4 Note that due to context-sensitive analysis, the average points-to set size is better 
than that mentioned in sections 2.2 and 2.1. 
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set size: A lower precision value (i.e. average points-to set size) implies a higher 
precision of the computed analysis result, as precise analyses aim at excluding 
unrealizable (at runtime) allocation sites from the points-to sets of variables. 

An IR may create many synthetic variables, among other reasons for method 
parameters or for ¢-nodes at control-flow joins of SSA-form. For example, three- 
address code re-uses the same variable in assignments in the if and else blocks 
of a conditional. However, SSA-based IRs insert a synthetic variable in a ¢-node 
at the control-flow join to select one of the distinct variables of the respective 
blocks. The presence of synthetic variables in IRs impedes the comparison of 
different analyses using the average points-to set size, as averages depend on 
the (unequal) number of variables. Therefore, we devise heuristics to establish 
comparability of our metrics for different IRs. 

Another challenge in this work is inferring the impact of each analysis param- 
eter on its precision. Computed at the end of the analysis, the average points-to 
set size loses information on the contribution of a particular aspect of pointer 
analysis. Therefore, we require a fine-grained metric to quantify the precision 
for each parameter. We propose two such techniques, one for the class hierarchy 
and the other for the intermediate representation. 


Class Hierarchy The analysis of the program’s class hierarchy builds the foun- 
dation for inferring relevant variables and heap allocations. However, each frame- 
work leverages a particular strategy to infer classes that contribute to the pro- 
gram’s semantics. Adding irrelevant classes to the class hierarchy may manifest 
into a synthetically precise analysis, as these classes add to the total number of 
variables (which will all be pointing to an empty set), thus potentially decreasing 
the average size of points-to sets. Some of these variables and heap allocations 
are not part of the actual code executed at runtime, but rather arise out of an 
imperfect model of the program analysis framework’s frontend. Here, we study 
the variables and heap objects stemming from the additional classes exclusive 
to a framework. 

We first instrument the Doop framework to log the class hierarchies and 
compare the class hierarchies obtained using Soot and Wala as frontends, which 
yields the classes exclusive to each of the frameworks. CH soot and CH wala de- 
notes the set of classes in the class hierarchies of Soot and Wala respectively. 
CH common = CH soot CH wala is the set of classes common to both frameworks. 
We define CH-precision in terms of the average points-to set size restricted to 
variables defined in methods of CH common. 


Definition 1. CH-Precision (CP). Let V$ be the set of variables defined in 
methods of CH common for the frontend f € {soot, wala}, and H§(v) =f{h|he 
points-to(v),v € VG} CH-Precision is the ratio of H and V$, ie., 
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If an analysis does not contain any exclusive classes or all of their variables 
(and corresponding heap objects) belong to the types present in the set of ex- 
clusive classes, CH-precision equals the average points-to set size. 


Intermediate Representation (IR) The choice of IR determines a program’s 
representation but retains the program’s semantics, particularly with respect to 
heap allocations. Thus, different IR’s can differ in the number of variables but 
will not introduce additional heap objects (e.g. Listing 1.4).A fundamental differ- 
ence between Soot’s Jimple and Wala’s SSA-based IR is that SSA creates unique 
variables for each variable definition, while three-address code does not. Render- 
ing our precision metric comparable for structurally different IRs is challenging, 
as tracking which variables correspond to each other is technically involved and 
may not be unique. Therefore, we rely on a heuristic to determine comparable 
variables. We motivate the heuristics considering two different IRs for the main 
method in Listing 1.1. Jimple (Listing 1.2) defines four variables, rO — r2, and 
parameterO, while Wala’s IR (Listing 1.3) defines three variables: v1 (implicit, 
not shown in the listing), v5, v8. 


Definition 2. Defm denotes the set of variables defined in a method. 
Defm(m, ir) = Us.es,,,, def (si), where Sim ir is the set of statements in method 
m for ir, def(s;) the variables defined in s;. 


Definition 3. Interesting Method. A method m is interesting if | Defm(m, wala)| 
# |Defm(m, jimple)| and m is defined in class C € CH common; i.e., the number 
of variables defined in the method with the same signature vary for different IRs. 
M denotes the set of interesting methods. 


To determine the set of interesting methods (M) we leverage the logs from 
pointer analyses and segregate the variables in the logs according to the declaring 
method (m). If the sizes of the corresponding sets differ for a method m, it is con- 
sidered interesting. (M is confined to the set of methods defined in CH common to 
exclude the exclusive classes.) Subsequently, we determine the points-to relation 
for the variables in M. 

Simple average of the heap objects and number of variables is insufficient 
for comparing the precision of the analysis between two IRs. Differences in class 
hierarchies and aliasing generates new variables, which makes the ratio incompa- 
rable if the heap objects are not same. To circumvent this problem, we combine 
average points-to set size with ideas from virtual call resolution. The number of 
virtual call sites in a program is identical irrespective of the differences in pro- 
gram representation (caused by aliasing and redundant variables). Therefore, we 
receive a fair comparison if we restrict the average point-to set size to the target 
variables of virtual method calls. We define a new metric, average devirtualized 
heap objects (Hf), which is the ratio of the total size of points-to sets of target 
variables at the virtual call sites to the number of virtual call sites. 


Definition 4. Average devirtualized heap objects (Hf). For the set of virtual 
call-sites C in the IR of a framework f and Vc şs as the set of invoking variables 
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at C, let H, = points-to(v) be the set of heap objects referred by v € Vo.f. 
Average devirtualized heap objects is 


X points-to(v) 
v€Vo, F 
ee 2 vE 
i |C] 


Based on the above discussion, we formulate and answer the following re- 
search questions: 
RQ1. How does the class hierarchy vary with the benchmarks? 
RQ2. How do differences in class hierarchies affect the precision of analyses? 
RQ3. How do the choice of IR affect the precision of the analysis? 
RQ4. How do the heap abstractions differ between pointer analysis frameworks? 


4 Evaluation 


We use Doop version 4.20.7-67 and Wala version 1.5.0. For RQ1-RQ3, we 
invoked Doop with the following analysis options: 1-call-site-sensitive, 
1-object-sensitive, 2-call-site-sensitive+heap, 2-object-sensitive+ 
heap. Specific options used in our study for each research questions are de- 
scribed in their respective sections. We use the DaCapo [2] (version 9.12-bach) 
benchmarks, a standardized suite of open-source Java applications, for our study. 


4.1 RQI1: Class hierarchy differences with benchmarks 


We captured the class hierarchies considered by the analyses to determine the dif- 
ferences. We instrumented Doop to log the classes considered during a (context- 
insensitive) analysis, which yields the complete class hierarchy. In order to in- 
vestigate whether the class hierarchy depends on the frontend, we performed 
this experiment with Soot and Wala as frontend’. Table 1 lists the differences 
in the class hierarchies using Soot and Wala. On an average, Wala exclusively 
contains ~13,994 classes in its class hierarchy. The number of classes exclusive to 
Wala range from 12,524 (Xalan) to 16,707 (Tradebeans). Soot’s class hierarchy 
on average contains 26 classes not present in Wala’s, ranging from zero to 62. 
In the case of PMD and H3, Soot’s class hierarchy contains only a single ad- 
ditional class, Jython has an additional 2 classes. Eclipse, Lusearch, and Luindex 
contain 62, 53, 53 additional classes, respectively. In the remaining cases the class 
hierarchy in Soot is strictly a subset of Wala’s. In next RQ, we will study the 
impact of these additional classes on the precision and scalability of the analysis. 


4.2 RQ2: Precision differences with class hierarchy 


5 Note that Soot and Wala provide options to exclude certain classes from analysis 
(to, e.g., exclude library classes). For a fair comparison we ignore this feature and 
compute the whole class hierarchy including libraries. 
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Table 1: Difference in classes considered by Soot and Wala. Last two columns 
show the extra classes loaded by Soot and Wala respectively. 


#classes analyzed|Extra classes 
Benchmark| Wala Soot|Soot Wala 
Avrora 21,997 9,204 0 12,793 
Batik 23,461 10,739} 12 12,734 
Eclipse 25,718 9,813} 62 15,967 
H2 21,007 8,042 1 12,966 
Jython 23,323 10,411 2 12,914 
Lusearch |20,469 4,671] 53 15,851 
Luindex 20,479 4,681) 53 15,851 
PMD 21,315 8,517 1 12,799 
SunFlow [20,677 7,847 0 12,830 
Tradebeans| 20,658 3,951 0 16,707 
Xalan 22,688 10,164 0 12,524 


Study Setup We have used the var-points-to relation, which maps all vari- 
ables and context pairs to their resolved pairs of heap-object and context. We 
select those variables that originate from classes common to both frameworks 
(Section 4.1) and query their points-to information. We then compute the CH — 
Precision based on Definition 1. 


Results Table 2 presents the results of the analysis (for one-callsite, one-object, 
and two-object context-sensitivity) for the objects and variables belonging to ex- 
clusive classes present in Wala (only non-zero values included). Note that the 
two-object sensitive analysis did not terminate for Eclipse and Jython, there- 
fore, these are not presented in the table. In one-callsite and one-objects analy- 
sis, Table 2 lists six out of eleven benchmarks contain variables that belong to 
the exclusive class hierarchy. The remaining benchmark applications show no 
differences in the number of variables and heap-objects, despite the presence 
of additional classes. It demonstrates that the additional classes loaded by the 
these frameworks have no influence on the precision of these benchmarks. 

The third and fourth columns of Table 2 list the number of variables (in 
principle, variable-context pairs) and heap objects belonging to the set of exclu- 
sive classes, respectively. In all analyses, all but one benchmark have a higher 
average points-to set size for exclusive variables than the general average. Trade- 
beans only creates 3 additional heap objects with Wala’ frontend, therefore the 
analyses are almost identical for both frontends. The average points-to sets for 
exclusive classes for bigger benchmarks such as Eclipse and Jython are outliers, 
showing very high averages. Still, the contribution of exclusive classes’ heap ob- 
jects and variables is negligible compared with the total heap objects of these 
benchmarks. 

The eighth and ninth columns depict the CH-precision and the original pre- 
cision for the analyses. We observe that the CH-precision is slightly lower than 
the precision for all benchmarks but tradebeans, which originates from the addi- 
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Table 2: Differences in precision in the presence of additional objects in class 
hierarchy (Wala). HO denotes the sum of number of heap objects in points-to 
set. CP wala is the precision score for variables in CH common- 


Exclusive classes Original 
Analysis Benchmark|] Vars. HO Average Vars. HO Precision || CP wala 
10S avrora 19 297 15.63 96,680 883,798 9.141}; 9.140 
eclipse 453 171,071 377.64||1,231,854 61,556,548  49.970]| 49.850 
h2 31 321 10.35 78,154 639,202 8.1781 8.177 
jython 35 17,682 505.2|| 289,244 8,000,917 27.661)) 27.603 
tradebeans 3 3 1 59,853 549,391 9.179}; 9.179 
xalan 39 2,466 63.23] 147,488 1,911,750 12.962]| 12.948 
10S avrora 19 14,844 781.26 82,972 404,231 4.871|| 4.694 
eclipse 388 329,008 847.95||1,053,618 46,337,474  43.979]| 43.683 
h2 31 2747 88.61 59,800 220,058 3.679 3.635 
jython 35 147,214 4,206.11|| 573,823 22,152,008 38.604|| 38.35 
tradebeans 3 4 1:33 45,807 154,883 3.381]| 3.381 
xalan 39 13,831 354.64|| 199,404 1,576,762 7.9071) 7.839 
208 avrora 19 1752 92.21|| 119,805 348,368 2.907}} 2.893 
h2 31 1195 38.54 82,795 242,667 2.930 2.917 
tradebeans 3 4 1.33 57,200 197,808 3.458|| 3.458 
xalan 55 4268 77.6|| 362,885 1,733,576 4.777|| 4.766 


Table 3: Differences in precision in the presence of additional objects in class 
hierarchy for Eclipse (Soot). 


Variables Heap Objects}CP soot Original 
1-call-site|Exclusive Classes 786 3331| 44.95 - 
Original 1.5M 68.5M - 44,92 
1-object |Exclusive Classes 1020 4130| 44.90 - 
Original 1.3M 60.8M - 44.87 

tional heap objects and variables. These primarily belong to the internal libraries 


such as sun.util, sun.util.resources (discussed later). 

With the Soot frontend (Table 3), the CH-Precision differs from Precision 
only for the benchmark Eclipse, for the other benchmarks the analysis does 
not contain any objects where the type belongs to the exclusive classes of the 
frontend. However, it is difficult to compare the precision of Soot v/s Wala on 
CH-Precision score due to differing variable numbers for the same benchmark 


application. 


Finding 1: Differences in class-hierarchy negligibly impact the pointer 
analysis precision (and thus client analyses). 


Soundness In our observation, the Wala frontend takes the internal Java libraries 
into account. We find heap objects belonging to libraries such as sun.nio.fs, 
sun. util. resources, sun.security, and sun.nio.cs, which are internal libraries used 
by the JVM. Soot, on the other hand, does not model these libraries for analysis. 

Comparing the class hierarchies of the analyses using Soot and Wala, we ob- 
served that the class hierarchy using Soot as frontend is a subset of Wala’s for all 
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Table 4: Total (for each framework) and interesting (section 4.3) methods M. 


Benchmark 1-CS 1-OS 2-OS 

Soot Wala M| Soot Wala My|Soot Wala M 
Avrora 3651 3678 3194| 3642 3669 3187/3615 3642 3159 
Batik 3407 3415 3006] 3398 3406 2999|3285 3293 2895 
Eclipse 20339 20281 18723)20261 20204 18655} Timed out 
H2 3041 3091 2673) 3027 3075 2661/2985 3029 2616 
Jython 8482 8531 7672| 8447 8494 7643| Timed out 


Lusearch 2449 2457 2135) 2440 2448 2128)2414 2422 2103 
Luindex 3524 3532 3132) 3514 3522 3124/3466 3474 3081 


PMD 4587 4596 4131} 4577 4586 4124/4418 4427 3978 
Sunflow 8369 8384 7514) 8335 8350 7475/7740 7754 6928 
Tradebeans|| 2442 2406 2083} 2433 2397 2076/2407 2371 2051 
Xalan 4607 5701 4125| 4597 5678 4115/4502 5503 4031 


benchmarks except Eclipse. This suggests that analyses with Soot are as sound 
as analyses with Wala for all benchmarks except Eclipse. Eclipse is a compelling 
case: Its analysis using Soot contains heap objects and variables that belong to 
the internal libraries of Eclipse, such as org. eclipse. core.internal.runtime. Perfor- 
manceStatsProcessor, while the analyses with Wala does not report these objects. 
However, results from the analyses with Wala contain heap objects from the in- 
ternal libraries such as sun.util.*, which are not present using Soot. It shows 
that the class hierarchy model is unsound in both frontends, as both lack some 
of the classes loaded by these benchmark applications at runtime. 


Our study reveals that library modeling in both Soot and Wala is unsound 
even for (non-native) Java objects, shown by the presence of heap-objects 
belonging to the exclusive classes of Soot and Wala. 


4.3 RQ3: Precision for IR varies with the framework 


Study Setup The study setup is similar to Section 4.2. We use the application’s 
var-points-to sets, i.e., the relation of variables and heap objects excluding the 
library objects. From the results of the three analysis sensitivities, we extract the 
set of interesting methods (M, Def. 3) and compute the average devirtualized 
heap objects score for the virtual calls in interesting methods. We use the Jimple 
IR (--no-ssa option in Doop), and Wala’s IR (--wala-fact-gen option in 
Doop) for evaluation. 


Results Table 4 reports the number of interesting methods and total methods 
resolved using both frontends. Note that the number of interesting method is 
identical for both frameworks for the same type of context-sensitivity. The num- 
ber of reachable methods in each analysis differs, just as the number of distinct 
methods signatures discovered in each framework (columns Soot, Wala in 1-CS, 
1-OS, 2-OS°). However, deriving a relationship between those is impossible, as 


6€ We excluded 2-CS for its large file sizes. 
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Table 5: Results for IR. Third and fifth columns are the number of heap objects. 
Fourth and sixth columns are the number of virtual calls. Last two columns lists 
the average devirtualized heap objects (Hf) for Soot and Wala respectively. 
Soot Wala Hf 
Analysis Benchmark ||/Heap Objs. Virt. Calls|Heap Objs. Virt. Calls| Soot Wala 
1 call-site Avrora 7,684 3499 7759 3499)| 2.20 2.22 
sensitive Batik 2,645 1588 2702 1588 1.67 1.70 
Eclipse 7.7™M 56.8K 7.9M 56.8K |/136.33 139.24 
H2 1,936 1,434 1,988 1,434 1.35 1.39 
Jython 662K 9,286 656K 9,283|| 71.33 70.67 
Lusearch 1,667 1,139 1,674 1,139 1.46 1.47 
Luindex 8,090 4408 8,098 4,408 1.84 1.84 
PMD 8,518 3,527 8,708 3,527|| 2.42 2.47 
Sunflow 4,741 2,088 4,627 2,088 2.27 2.22 
Tradebeans 1,638 1,114 1,649 1,106 1.47 1.49 
Xalan 43K 5,832 55K 5,850|| 7.45 9.44 
1 object Avrora 6,561 3,498 6,563 3,498 1.88 1.88 
sensitive Batik 1,673 1,587 1,709 1,587 1.05 1.08 
Eclipse 2.9M 56.7K 3.0M 56.8K|| 51.61 53.53 
H2 1,218 1,433 1,258 1,433)| 0.85 0.88 
Jython 3.5K 9,272 3.6K 9,269||386.79 389.20 
Lusearch 958 1,138 964 1,138 0.84 0.85 
Luindex 4,530 4,407 4,552 4,407|; 1.03 1.03 
PMD 7,369 3,527 7,518 3,527|| 2.09 2.13 
Sunflow 2,978 2,088 2,864 2,088 1.43 1.37 
Tradebeans 928 1,113 938 1,105|| 0.83 0.85 
Xalan 99K 5,830 106K 5,810} 17.11 18.33 
2 object Avrora 8,561 3,459 8,563 3,459|| 2.47 2.48 
sensitive Batik 1,257 1,567 1,275 1,567)| 0.80 0.81 
H2 1,288 1,433 1,307 1,433)| 0.90 0.91 
Luindex 5,210 4,363 5,215 4,363 1.19 1.20 
Lusearch 948 1,138 954 1,138]; 0.83 0.84 
PMD 7,271 3,496 7,398 3,496]; 2.08 2.12 
Sunflow 2,342 2,088 2,324 2,088|} 1.12 1.11 
Tradebeans 919 1,113 929 1,105}; 0.83 0.84 
Xalan 214K 5,791 215K 5,771]) 36.97 37.36 


analyses such as one-call-site and one-object are not comparable. In all cases, we 
observed that the majority (~90%) of the methods are interesting. Therefore, 
we cannot ignore the significance of this aspect. 


Interesting methods are difficult to ignore because of their sheer presence in 
the benchmarks applications. 

Table 5 presents the differences in the average devirtualized heap objects 
for Jimple and Wala IR. Although the number of variables and abstract heap 
locations are dependent on the IR, we did not observe many differences between 
those when restricting ourselves to target variables of virtual method calls, which 
corresponds to our intuition. The differences in the Hf values for both IRs 
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Table 6: Differences Soot IR v/s Wala IR for Xalan 


Methods Wala|Soot] Actual 
org.apache.xalan.transformer.TransformerImpl.transformNode() | W | % Vv 
Exceptions Y x Y 
org.apache.xalan.xsltc.trax. TransformerFactoryImpl.setFeature()| % | v Y 
MethodResolver.getConstructor() Y |% Y 
xerces.xml.dtd.XMLDTDLoader() Y |x — 
org.apache.xpath.getSourceTree() Yv |% Y 


Listing 1.5: Differences in types of heap objects created in both analysis 


1 (Wala) sun.misc.URLClassPath$Loader 
2 (Wala) java.util.zip.ZipError 
3 (Soot) javax.xml.transform.FactoryFinder$ConfigurationError 


are negligible except for three larger benchmarks, Jython, Eclipse, and Xalan. 
Overall, the values from Soot IR, were smaller than those of Wala, implying 
that devirtualization in Soot is either slightly more precise or slightly less sound 
than in Wala, however, the differences are minor in the majority of the cases. In 
conclusion, the choice of IR shows little to no impact on the precision of pointer 
analysis. In the sequel, we describe one such case study where the difference in 
Hf is approximately two, which is a significant figure as compared to others. 


Finding 2: IR has negligible impact on the precision of pointer analysis at 
least for the devirtualization client. 


Case Study—Xalan To further investigate the differences, we chose Xalan using 
a one-call-site analysis as the Hf values for Soot (7.45) and Wala (9.44) display 
the highest difference among all benchmarks. The number of heap objects in 
both cases differs significantly, with Soot having 43K heap objects, and Wala 
having 55K heap objects for a comparable number of virtual calls (5,832 vs. 
5,850). 

To examine the heap objects, we collected their class types. We observed that 
the types of some of these objects belongs to the classes in CH soot\ CH common Or 
CH wata\ CH common. Listing 1.5 depicts the differences in heap objects created 
by these frameworks. 

We also discovered (potential) sources of imprecision and unsoundness in 
both analyses. Table 6 lists methods and exceptions missed by both Soot and 
Wala frameworks. Note that these methods and exceptions belong to the com- 
mon class hierarchy. We observed that Wala has precise exception modeling 
compared to Soot. For other virtual methods invocations, we compared the run- 
time call-graph to the static call-graph. In our observation, both Wala and Soot 
are unsound, as demonstrated by the absence of certain method calls in the call- 
graph for both analyses. In addition, Wala imprecisely includes xerces.xml.dtd. 
XMLDTDLoader() into its call-graph (which at least in our experiments was not 
executed at runtime). 
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Apart from reflection, imprecise/unsound virtual call resolution also induces 


imprecision /unsoundness into the analysis. 


4.4 RQ4: Heap abstractions in pointer analysis frameworks 


In this section, we compare Doop’s Table 7: Number of Heap objects 
analysis using Wala’s frontend with - 
Wala’s own analysis. We omit the #Heap Objects) Types 


comparison with the Soot frame- [Benchmark/Doop Wala|Doop Wala 


work as it leverages IRs different |avrora 2,504 28,235) 751 3,256 
from Wala’s and thus would not be |batik 1,699 16,724} 537 1,938 
comparable. h2 1,467 16,688) 482 1,934 
lusearch 1,242 16,274} 551 1,898 
luindex 1,901 19,343] 404 2,250 
pmd 2,398 31,774) 734 2,498 
sunflow 4,424 = 16,688}1,196 1,934 


heap objects for each call-site, heap tradebeans |1,230  16,734| 405 1,937 
cloning) analysis available in the xalan 3,874 18,174] 1,003 2,078 
Wala framework with a one-call-site with one-level heap abstraction in Doop, 
and set the time budget to 7 hours. Analyses with a higher level of call-site sensi- 
tivity were not scalable in the Wala framework and therefore, we do not leverage 
those. Other optimizations in Wala, such as the use of object-sensitivity only for 
collection objects, are not comparable to the object-sensitive analysis available 
in Doop. Therefore, we also choose to ignore it. To handle reflective calls in Wala, 
we use the option REFLECTIONS. FULL. In what follows, we present the results of 
our study. We first present the differences in the number of heap objects and, 
subsequently, delve into its implications. 


Study Setup We compare the 
one-call-site sensitive with context- 
sensitive heap abstraction (unique 


Differences in the heap objects For evaluation, we extracted the heap-objects cre- 
ated in Wala’s and Doop’s analyses and observe huge differences in the number of 
heap objects created. Intuitively, using the same level of heap-sensitivity (heap- 
cloning) should create the same number of heap objects. However, in certain 
cases, the number of heap objects in Wala exhibits a factor of ~14 compared to 
those in Doop (columns 2 and 3 in 7). (Note that eclipse and jython are elided, 
as the analyses did not terminate within the time budget owing to the large 
file size (~100GB).) Therefore, the heap abstractions of these analyses are not 
comparable, although superficially they look similar. 


Subtle optimizations also manifests into imprecise heap modeling even 
though, at the outset, they look similar. 


To investigate this further, we compared the the types of the heap objects. 
Our study shows that the set of types are not even consistent using the same 
frontend! In many cases the types of objects analyzed by Wala is approximately 
four times those in Doop (columns 4 and 5 in Table 7). The differences in heap 
abstraction for application level objects build the reason for this. 
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Application level objects Application level objects, i.e., the heap objects cre- 
ated due to allocations within the program (rather than libraries.) In three out 
of eleven benchmarks we observe that Doop’s analysis is lacking application 
level classes that Wala reports. We found corresponding allocations on a man- 
ual inspection of the source code. For example, in avrora, the analysis in Wala 
allocates heap objects of BRNE_ builder [8], which are not present in Doop’s. 
Similar cases can be found in PMD and Xalan. However, owing to the limita- 
tions of the program representation, we could not determine the precise reason 
for the unsoundness. Pointer analysis uses an IR based on a control flow graph 
(CFG) rather than source code. Being a lower level representation of the program 
source code the IR mangles variables names. Therefore, a one-to-one correspon- 
dence between the IR’s variables and variables in source code is not trivial. 


Finding 3: Heap modeling is not similar even for allocations within the 
application scope. Wala handles application levels objects more precisely than 
Soot in our evaluation. 


5 Threats to Validity 


Naturally, the technique used relies on the precise handling of reflection calls and 
other dynamic features of the languages such as dynamic proxies. Other than 
that, handling of native calls could alleviate the unsoundness of the analyses. 
Analysis of native calls could infer the native objects in JVM missed by the Soot 
framework. Here, we have used the TamiF lex framework for handling reflection 
calls. Other approaches have improved the reflection handling [10, 15-18, 25]. To 
convince ourself, we experimented with one of the state-of-the-art techniques, 
i.e., reflection with matching substring resolution [10]. However, we did not find 
any significant differences in results. Another limitation of this study is the un- 
soundness from ignoring the native library calls in static analyses. Few of the 
sources of unsoundness discovered stem from the native calls. Recently, Four- 
tounis et al. [7] proposed a technique for resolving native calls in Java. However, 
at the time of writing this paper, the technique was not available. Further, our 
analysis in Section 4.3 is based on test-cases which may not reflect all possible 
executions of an application. 

Our study also involves hours of manual evaluation which can be subject to 
bias. To counteract it, we did a manual inspection of the source code, especially 
for the sources of unsoundness. We had rerun the benchmark applications with 
valid inputs to determine to compare and reassert that the objects are actually 
allocated during runtime. 


6 Related Work 


Pointer analysis tools Pointer analysis has garnered significant interest in the 
last decades, focussing on scalability, precision, and soundness. The Doop sys- 
tem used in this paper results from years of research on declarative-style pointer 
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analysis [1,3, 10,24, 26]. Similarly, the Wala framework was a result of an indus- 
trial project and, unlike Doop, follows an imperative paradigm. The underlying 
program representation comes with many prior assumptions mentioned. In this 
work, we study the effects of these assumptions on program analysis. 


Empirical studies on pointer analysis Recent empirical studies focussed on the 
soundness limitations from dynamic features of languages in existing pointer 
analyses and call-graph construction as pointer analysis and call-graph con- 
struction are closely related static analyses and are mutually dependent. Di- 
etrich et al. [5] proposed automated and manual techniques to generate un- 
soundness oracles to test static analysis. Sui et al. [32] present the causes of 
unsoundness in static analysis frameworks (Soot, Wala, and Doop) due to the 
dynamic features of languages. Rief et al. [21] did a comprehensive study, fo- 
cussed on features in Java 9, for call-graph generation algorithms and expose the 
problems in the state-of-the-art esp. related to method calls in the Java runtime. 
Our work is orthogonal: we evaluate the influence of program representation on 
program analyses. Here, we rather focus on the program representation in static 
analysis frameworks and also the unsoundness arising out of it. Our study is also 
extensible for Java 9. 

Sui et al. [33] evaluated the recall of call-graph construction and present 
how it impacts the algorithms in practice. Their evaluation expose the problems 
in the state-of-the-art esp. related to method calls in the Java runtime. Our 
unsoundness results concur with theirs. Here, we have focussed on program rep- 
resentation rather than the dynamic features of the language, which are hard to 
analyze for static analyzers. Further, our work features two novel metrics apart 
from the standard precision and recall, to measure the impact of different aspects 
of program representation. 


7 Conclusion 


This paper reports the effects of program representation on program analysis. 
Our metrics makes it possible to compare implementations leveraging different 
frontends. We find that differences in program representation have negligible im- 
pact on the precision of the pointer analysis. In addition, we also discovered novel 
sources of unsoundness and imprecision in the program analysis. Our results also 
demonstrate that the promised heap abstraction are practically not similar, even 
though they may appear so on a birds eye view. Since pointer analysis builds the 
foundation of many static analyses, we conjecture the results generalize these, 
as well. 
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Abstract. Structural runtime models provide a snapshot of the con- 
stituents of a system and their state. Capturing the history of runtime 
models, i.e., previous snapshots, has been shown to be useful for a number 
of aims. Handling, however, history at runtime poses important chal- 
lenges to tool support. We present the INTEMPO tool which is based 
on the ECLIPSE Modeling Framework and encodes runtime models as 
graphs. Key features of INTEMPO, such as, the integration of temporal 
requirements into graph queries, the in-memory storage of the model, 
and a systematic method to contain the model’s memory consumption, 
intend to address issues which seemingly place limitations on the avail- 
able tool support. INTEMPO offers two operation modes which support 
both runtime and postmortem application scenarios. 


Keywords: runtime models - time-awareness - temporal graph queries 


1 Introduction to InTempo 


A (structural) Runtime Model (RTM) provides a snapshot of the constituents of 
a system and their state [3]. RTMs are typically employed in the context of Self- 
adaptive Systems (SAS) [4], where a feedback loop adapts the system behavior 
at runtime in response to external or internal stimuli, the latter represented as 
model fragments in the RTM and detected via the execution of model queries. 

Encoding an RTM as a graph enables detection via graph queries, which 
specify a sought (graph) pattern. Such an encoding conforms to a metamodel 
which restricts the structure of model instances and defines types of vertices, 
edges, and attributes. Formally, these concepts rely on typed, attributed graph 
transformation [6] where graphs are typed over a type graph. 

Capturing the history of RTMs, i.e., previous snapshots, may be useful for 
a number of aims such as the detection of recurrent behavior or postmortem 
analysis [3,8]. However, handling history at runtime poses important challenges 
to tool support. Tools are required to enable the specification and timely execu- 
tion of queries with temporal requirements, i.e., requirements on the evolution of 
patterns over multiple snapshots. Timely execution is crucial for SAS, where a 
loop may depend on query results before planning and performing adaptations. 

Faced with these challenges, the available tool support is seemingly limited 
either by the lack of support for direct specification of temporal requirements 
in graph queries [5] or by the on-disk representation of the model [8,11] that 
introduces an overhead on execution times in runtime settings, e.g., in SAS. 
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We present the INTEMPO (INcremental queries with TEMPOral requirements) 
tool (available online at [13]) which is based on the eponymous querying scheme 
in [15] and aims at mitigating these limitations. INTEMPO introduces ITQL, a 
language for the specification of temporal graph queries, which allow for the ex- 
pression of temporal requirements. The core functionality of INTEMPO executes 
a query over an in-memory RTM which captures information about previous 
snapshots, called Runtime Model with History (RTM®), and returns the pattern 
occurrences in the RTM that satisfy the specified temporal requirements. IN- 
TEMPO is implemented in the ECLIPSE Modeling Framework (EMF) [7] and can 
be used either via the ECLIPSE user interface or via an API. The latter enables 
INTEMPO results to be utilized by other tools, e.g., a SAS feedback loop. 

INTEMPO offers two operation modes intended for different application sce- 
narios (see Figure 1 for an illustration). The RTM" Analysis (Section 2) con- 
stitutes the core functionality of INTEMPO and executes a user-specified ITQL 
query (in a file with .itql extension—required extensions are in parentheses in 
Figure 1) over a user-provided RTM®, i.e., a persisted instance of an EMF 
model (in the standard xmi format). This mode returns the query results for the 
given RTM". Query results are kept in-between analyses and are updated by 
each RTM# Analysis, which is also known as incremental (query) execution. The 
RTM” Analysis is intended to be used in settings where query results can be fur- 
ther utilized at runtime. For instance, a SAS feedback loop may use INTEMPO 
to detect problems formulated as patterns, similarly to [9]. Subsequently, the 
query results may be utilized to plan adaptations which address these problems. 

The LogAnalysis operation mode (Section 3) assumes that, instead of being 
captured by an RTME, past and present data about the system have been cap- 
tured in an event log. INTEMPO introduces E2P, a specification language that 
allows for the mapping of event types to corresponding modifications of model 
fragments, i.e, nodes, edges, and attributes. As input, LogAnalysis requires the 
ITQL query, the log (with comma-separated values), and the E2P mapping. It 
then processes the log and maintains an internal RTM" which it uses to per- 
form RTM! Analysis upon every event. LogAnalysis is intended for postmortem 
scenarios. Thus, it returns the results that were valid after each RTM Analysis 
sorted by the log timestamps, which affords a global, yet detailed, view on the 
evolution of the system state. 

INTEMPO is capable of containing the data accumulation in the RTM by 
systematically discovering and discarding data that is obsolete with respect to a 
given timestamp, i.e., not relevant to future query executions—this capability is 
presented in detail in [15]. Note that an implicit requirement of both operation 
modes is that the metamodel of the analyzed system has been encoded as an 
EMF Ecore model and is available in ECLIPSE (gray input in Figure 1). 


; Log Analysis 
Query (ital) N RTME Query (itql) 177 eee A 
i ` RTM! |; 
RTME (xmi) 7 Analysis 1 internal a: 
i Log (*) —” i RTME Analysis i— Mapping 
SAS Feedback Loop (e2p) 


Postmortem analysis 


Fig. 1: INTEMPO Execution Modes and Exemplary Application Scenarios 
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Exemplary Application To demonstrate Ganson Pump 
the features and operation of INTEMPO we fets: long SHS cts: long 
dts: long (ets: long |? dts: long 
rely on an example drawn from the case-study lia: string dts: long id: string 
conducted in [15]. Based on real-world smart btatus: string status: string 
medical environments, the case-study envi- sae Bs 
sions a Smart Healthcare System (SHS) where Big eves Mende) 
certain medical procedures are automated and performed by devices, such as a 
smart pump administering medicine or a sensor tracking patient data and diag- 
noses—as otherwise a clinician would be doing. Data collected from the SHS are 
aggregated and recorded in medical (event) logs. 

INTEMPO requires a metamodel which has been instrumented such that all 
nodes have at least two attributes named cts and dts, which capture the time 
point of creation, respectively deletion, of the node in the system. As an example, 
see the metamodel of our SHS in Figure 2. Note that to encode cts and dts 
for edges in EMF, the respective edges would have to be modeled as nodes. 
Technically, an RTMĦ is an instance of such a metamodel. See G3 in Figure 3 
for an example based on the SHS metamodel: The RTME reflects that a node 
of type Sensor that is attached to the patient with id=1 has been activated and 
thus has been added to the SHS at timestamp 3. The sensor status reflects that 
the patient has been diagnosed with sepsis. The value oo reflects that a dts for 
this node has not been set, i.e., the node is still present in the modeled system. 


2 RTM"! Analysis 


This section presents an exemplary query in ITQL which it then uses to demon- 
strate the RTM! Analysis. It concludes with technical details. 

InTempo Query Language (ITQL) Formally, a temporal graph query q is 
characterized by a (graph) pattern p and an application condition ac, denoted 
q= (p,ac). A match m corresponds to an occurrence of p in the RTME. In order 
for m to be valid, it must satisfy the ac. ITQL supports the formulation of ac in 
the Metric Temporal Graph Logic (MTGL) [10] which supports operators such 
as negation (~), existential quantification (4), conjunction (A), and the metric, 
i.e, interval-based, temporal operators until (Ur, where I is a time interval over 
Rọ) and since (Sr), as well as abbreviations such as eventually, i.e., 0; In = 
true U; dn, where n is a graph pattern and true is always satisfied. MTGL also 
supports the nesting of patterns to bind graph elements in outer conditions and 
relate them to inner (nested) conditions, i.e., elements common to two patterns 
nı and ng refer to the same element in the RTM". 

MTGL is able to express real-time properties such as “every patient diag- 
nosed with sepsis, must eventually within 5 time units be given the proper drug” 
(adjusted from the medical guideline in [14]). In an RTM" of the SHS, In- 
TEMPO can find violations of the property above by executing the ITQL query 
qı = (nı, k), with k the MTGL formula =(Qj0,5] nz) and n1, no patterns rep- 
resenting a sepsis diagnosis and drug administration respectively. The query 
searches for matches of nı in the RTM® that satisfy «, i.e., for patients that, 


Keeping Pace with the History of Evolving Runtime Models 265 


G; Gs 
s:Sensor s:Sensor u:Pump 
sepsis diagnosis,ts=3,id=1 4 cts = 3 shs:SHS cts = 3 shs:SHS cts = 9 
drug administration,ts=9,id=1 4 dts = œ [cts = 0 dts = œ tct =0 >| dts = œ 
id = 1 dts = œ% id = 1 dts = 00 id=1 
lstatus=sepsis| status —=sepsis| status=drug 


Fig.3: Exemplary Medical Log and Corresponding RTM! G3 and Gg 


although diagnosed with sepsis, did not receive a drug within the designated 
time. In INTEMPO, each match is associated with a temporal validity, i.e., a set 
of time intervals for which, based on the overlap among the cts and dts of the 
matched elements and the interval for which ac is satisfied, the match is valid. 
ITQL also allows for the definition of OCL constraints [12] on sought patterns. 


Output The ITQL specification for the 
query qı is shown in Figure 4. Performing 
RTM! Analysis for the query qı on the RTMË declarations{ 


cni m iteraa ULOS] ise fin) 9) 


ý 3 ni{shs:SHS 
Gg of Figure 3 returns one match, since there se Senso? 
is indeed no Pump attached to the SHS, i.e., a BNE) FOTOTJES S 

ee i 4 LOCL: us status- sepsis 2u] 
match for nə, within five time units after a 
Sensor was activated, i.e., a match for nı was ener 2 SHS 
. 4: . É sS:sensor 

found. The temporal validity interval [3, 4] is mais 
returned together with the match. The match, Shs -OgModPapE= > ù 
Š Š $ Soia P E shs -ownedSensors-> s 
i.e., violation, is indeed valid only for that in- Lochs n aeee e a 


terval since after timestamp 4, a match for ng + 
starts to exist within five time units of a match 
for nı. If the API of INTEMPO is used, the 
query returns the match of the nı pattern, i.e., the EMF objects, together with 
the temporal validity. In case INTEMPO is used via the UI it displays a message 
box in ECLIPSE with the following message: SHS@O[] Sensor@3[status=sepsis] 
[[3,4]]. Note that “@” precedes the cts of an object and values within square 
brackets are attributes of the object. 


Fig. 4: Example query in ITQL 


Technical Details For the execution of temporal graph queries, INTEMPO em- 
ploys the operationalization framework presented in [15]. The framework sup- 
ports the decomposition of a query into a suitable ordering of simpler sub-queries 
which is executed bottom-up. The outermost query computes the overall result. 
For pattern-matching, INTEMPO employs the Story Diagram Interpreter from [1] 
which uses heuristics shown to reduce the pattern-matching effort. INTEMPO 
provides an XTEXT [2] editor for ITQL which supports completion suggestions 
for element types and validation of the query syntax. 


3 LogAnalysis 


This section demonstrates the LogAnalysis operation mode which assumes that 
data from past states have been captured as events in a log. INTEMPO offers 
the capability to process the system changes and, upon each change, obtain an 
updated RTM# which is then used internally to perform RTM” Analysis. 
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Events-to-Patterns (E2P) Spec- init:{aads shs:SHS} 

ification Language The mapping eee ee 

of log events (which encapsulate SyS- adds s:Sensor [status="sepsis" id= 
. . _ *p2] 

tem changes) to proscribed modifica adds-ref shs -ownedSensors-> s} 

tions on an RTM” is facilitated by 


E2P. An E2P specification consists W ueie aa 902 =e 
adds u:Pump [status="drug" id=*p2] 


of mappings between events and ac- adds-ref shs -ownedPumps-> u} 
tions that should be performed on 
an RTM#. E2P supports five actions Fig.5: E2P Example for SHS 


(formulated as verbs): adds, to add a node and optionally assign values to the 
added node’s attributes; adds-ref, to add an edge between two nodes; modifies, 
to modify the attribute values of a node; deletes and deletes-ref, to delete a node, 
respectively an edge, from the RTM”. To accommodate linked data, E2P allows 
for the indexing of added nodes so that later events can refer to modifications 
that have been processed earlier. An example of an E2P mapping from an ex- 
emplary log in Figure 3 (left) to the corresponding elements of the SHS is shown 
in Figure 5. Note that edge types, e.g. OwnedPumps, are not depicted in Figure 3. 

As an example, the event drug administration from the medical log in 
Figure 3 corresponds to the following changes to the (internal) RTM®: a Pump is 
created; its attribute status is set to “drug” and its attribute id takes the value 
of the second field after the event name (expressed by the special xp token), i.e., 
the id field in the log of Figure 3. By default, the cts is set to the value of 
the event field that is next to the event name, i.e., the ts field in Figure 3. The 
init statement is used to initialize the RTM" and the cts of nodes within is set 
to zero. To increase the readability of specifications, an explicit assignment for 
the dts may be omitted: Unless there is an attribute assignment, the dts of all 
nodes is set to the maximum value supported. 


Output LogAnalysis provides a view on the matches per event timestamp. Per- 
forming LogAnalysis on the query qı and the log of Figure 3 would return: 


@3 SHS@O[] Sensor@3[status=sepsis] [[3,00]] 
@9 SHS@O[] Sensor@3[status=sepsis] [[3,4]] 


First, the sepsis diagnosis event is processed which makes the internal RTM" 
be identical to G3 in the same figure. The query is executed using RTM" Analysis 
and returns a match, i.e., violation, since at that moment a match for ng does 
not exist in the graph. The temporal validity is equal to [3,00], i.e., the match 
is valid from time point 3 onward. Next, the drug administration event is 
processed which leads to Gg. The result of RTM! Analysis for Gg is the same 
as the result described in Section 2. 


Technical Details In LogAnalysis the query execution framework monitors the 
RTME for changes and, upon every change, recomputes the matches. Previous 
matches are kept in-between executions and therefore the query is executed 
incrementally. Similarly to ITQL, E2P is supported by an XTEXT editor that 
offers syntax validation and completion suggestions for element types. 
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4 Conclusion and Future Work 


We presented INTEMPO, an EMF tool which enables the specification and incre- 
mental execution of temporal graph queries over a runtime model with history. 
The latter can be either provided as input or obtained by an event log. INTEMPO 
stands out from relevant tools owing to the integration of temporal requirements 
into graph queries, the in-memory representation of the model, and the sys- 
tematic measures to contain memory consumption despite the accumulation of 
temporal data. Moreover, INTEMPO offers input editors with features that aim 
at helping the user, e.g. syntax validation. In the future, besides streamlining 
INTEMPO, we plan to perform extensive evaluation and comparisons with other 
tools. Moreover, we plan to explore the utilization of INTEMPO in self-adaptation 
scenarios where the history of the system is required. 
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Abstract. Compilers are error-prone due to their high complexity. They 
are relevant for not only general purpose programming languages, but 
also for many domain specific languages. Bugs in compilers can poten- 
tially render all programs at risk. It is thus crucial that compilers are 
systematically tested, if not verified. Recently, a number of efforts have 
been made to formalise and standardise programming language seman- 
tics, which can be applied to verify the correctness of the respective com- 
pilers. In this work, we present a novel specification-based testing method 
named SpecTest to better utilise these semantics for testing. By applying 
an executable semantics as test oracle, SpecTest can discover deep se- 
mantic errors in compilers. Compared to existing approaches, SpecTest is 
built upon a novel test coverage criterion called semantic coverage which 
brings together mutation testing and fuzzing to specifically target less 
tested language features. We apply SpecTest to systematically test two 
compilers, i.e., the Java compiler and the Solidity compiler. SpecTest im- 
proves the semantic coverage of both compilers considerably and reveals 
multiple previously unknown bugs. 

Keywords: Mutation testing - Compiler testing - K framework - Formal 
semantics - Rare language features 


1 Introduction 


Compilers must be thoroughly tested (if not verified) for multiple reasons. First, 
compilers are essential for the software ecosystem. Their correctness is a prereq- 
uisite for program correction. That is, a compiler bug might propagate to all pro- 
duced programs. Second, compilers are error-prone due to their high complexity. 
Their main functionality is to convert source code to executable machine code. 
They often provide additional features, like code optimisation or debug utilities. 
A variety of compilers has been written for countless languages. Modern compil- 
ers like GCC, javac, and LLVM are overwhelmingly complicated (e.g., GCC has 
more than 7M lines of code and OpenJDK has more than 11M [20]). Although 
some of them have been used for decades, they may still be buggy [5455]. 
Recently, there have been numerous efforts on formalising and standard- 
ising programming language semantics, such as K-Java [24], C semantics [29], 
KJS [47], or KSolidity [34]44], which readily serve as a specification of the respec- 
tive compilers. Usually, these executable semantics are accompanied by manually 
crafted unit tests. Such tests are however designed to test the semantics rather 
than the compliance of the compiler to the language semantics. In this work, we 
aim to better utilise these semantics by automatically generating test programs 
with a novel coverage criterion that facilitates systematic compiler testing. 
Multiple approaches have been recently proposed to test compilers. Most of 
them successfully found compiler bugs. For instance, the EMI project discovered 


© The Author(s) 2021 
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 269-291, 2021. 
https://doi.org/10.1007/978-3-030-71500-7_14 


270 R. Schumi and J. Sun 


more than 1600 bugs in GCC and LLVM [53]. Another study has revealed bugs 
in the Java compiler by comparing different javac and JVM versions |27]. For 
the relatively new Solidity (smart contract) language, many crashes were found 
through fuzzing [28]. Moreover, bugs in compilers may be exploited by attackers. 
For example, prior to version 0.5.0, the Solidity compiler had an uninitialised 
storage pointer vulnerability that affected many smart contracts on Ethereum. 
A honey pot named OpenAddressLottery was designed to exploit this vulnerably 
and steal ether (i.e., digital money in Ethereum). There are hundreds or even 
thousands of programming languages according to different sources [30] and 
many new ones emerge every year. For example, various new general purpose or 
domain-specific languages have been developed recently, such as Rust, Kotlin, 
Solidity, and Move. 


Compiler testing is an ongoing research field. Next, we briefly review existing 
approaches according to how they address the following two problems. 


1. The test generation problem: how are test cases (i.e., programs with specific 
inputs) selected and generated? 
2. The oracle problem: how are testing results deemed successful or failure? 


Existing compiler testing approaches solve the test generation problem mainly 
through two ways, by generating programs according to a grammar that spec- 
ifies the syntax of a language |49J31/23|, or by mutating existing seed pro- 
grams [055M1]. For the former, due to a huge search space, additional selection 
criteria must be applied to selectively generate test cases for compilers, such as 
standard code coverage criteria like statement coverage. For the latter, existing 
mutation strategies are often limited by the ‘weak’ oracles (as we will discuss 
shortly) employed by the approach, e.g., mutating to introduce ‘dead’ code. 
Generally, approaches which generate complicated syntax focus more on parsing 
errors instead of errors in the semantics. For the oracle problem, existing propos- 
als mainly have three oracles. The first oracle is one that only flags a test failure 
if the program is incompilable or leads to crashes [28]. The second oracle flags 
a test failure if certain algebraic properties are violated. For instance, the alge- 
braic property adopted in the EMI approach is that mutating unreachable 
code does not change the execution result. We remark that these two oracles are 
‘weak’ as they are unable to detect simple semantic errors such as 3+4 = 8. The 
third, stronger oracle is one that checks whether the output of a test program is 
consistent with a reference, which could be a second compiler (i.e., differential 
testing ), or an abstract specification like a state machine [85186]. This oracle 
requires a reference, which is not always feasible. Furthermore, it is limited to 
bugs which result in inconsistencies between the compiled program and the ref- 
erence. Last but not least, existing approaches do not provide a good adequacy 
measurement on the progress of compiler testing. Often measurements, like code 
coverage, are used as an indicator, but they have the limitation that they need 
access to the compiler code, and achieving full code coverage is challenging. 


In this work, we present a novel specification-based testing method called 


SpecTest for compiler testing. SpecTest differs from existing approaches in the 
following aspects. First, SpecTest is built upon a strong oracle, i.e., an executable 
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language specification that can predict the expected output of test programs. 
This strong oracle enables us to detect semantic errors, i.e., bugs that are related 
to the semantics. Such bugs may also originate from the runtime environment. 
Hence, SpecTest is not just limited to classical compiler bugs. Second, SpecTest 
offers a testing adequacy measurement in term of semantic coverage and has a 
built-in mutation-based test case generation method which aims to achieve high 
semantic coverage. The semantic coverage measures the number of language 
semantic rules that are covered by existing test cases. The test case generation 
method mutates the seed programs accordingly to maximise the coverage of the 
language semantics, e.g., by introducing less-tested language features into these 
programs. Compared to measuring the code coverage of a compiler, our semantic 
coverage has the added value that it does not need access to the compiler code, 
and it specifically targets semantic bugs. 

Given a language semantics (in the form of a set of small-step operational 
semantic rules), SpecTest executes fully automatically. We have implemented 
SpecTest for two compilers, i.e., the Java compiler and the Solidity compiler and 
tested the language features that are supported by our applied semantics [24]44]. 
The results of the evaluation were promising. SpecTest successfully increases the 
semantic coverage for both compilers, and identified many bugs and issues that 
helped the compiler and specification developers. 

To sum up, we make the following technical contributions. 

— We propose a semantic coverage criterion for measuring the adequacy of 
compiler testing. 

— We introduce a novel compiler testing method that uses an executable lan- 
guage specification as an oracle. 

— We demonstrate the applicability and generality of SpecTest by applying it 
to two compilers. 

The paper is structured as follows. Sect. 2]explains our method and discusses 
the required components in detail. In Sect. |3| we present our evaluation with 
two compilers. Next, we review related work in Sect. 4jand conclude in Sect. 


2 Method 


In this section, we outline how SpecTest works. In particular, we present its 
high-level design, highlight relevant details of its components, and explain the 
workflow step by step using an example. 


2.1 Overall Design 


The overall workflow of SpecTest is depicted in Fig. In the following, we 
introduce the tasks briefly before diving into the details of the main components. 

(1) A set of user-provided seed programs are given as input to a program 
fuzzer one by one, which generates a set of test inputs for each program with 
the intention to cover as many program paths as possible. A program and the 
associated test inputs form a test case that is the basis for the next phase, the 
test execution and evaluation. (2) The program is compiled with the compiler 
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semantic (SOS) rules. 
The final state is ob- 
tained as the semantic execution result. During the semantic execution, we 
monitor how frequent each SOS rule is fired in order to identify rarely fired 
rules. (4) The results of the program and semantic execution are compared in 
order to assess whether the program (built by a compiler) produces an output 
which is consistent with the language semantics. If the results are inconsistent, 
the test case is flagged as a failure. The failure may be either due to a bug in 
the compiler (or the execution environment of the program, e.g., JVM) or in 
the language semantics. (5) We rank the SOS rules according to the number of 
times they are fired and identify the ones which are least fired. Each SOS rule is 
typically associated with one language feature and thus we are able to system- 
atically identify language features which are least tested. With the information, 
a program mutator mutates the seed programs so that the corresponding lan- 
guage features are introduced systematically into the programs. In contrast to 
classical mutation testing [33], which ensures the quality of test suites, we apply 
mutations to generate more and better test cases. (6) We then repeat from step 
(1), and the process continues until a user-specified timeout is triggered. The 
output of SpecTest includes a set of passed/failed test cases as well as a report 
on the semantic coverage, i.e., the number of times each SOS rule is fired. 

It should be noticed that there are three main components in SpecTest, i.e., 
the executable program semantics which serves as oracle, the program fuzzer, and 
the program mutator. We present details of these components in the following. 


Fig. 1: Overview of the data flow of SpecTest 


2.2 The Oracle 


The oracle is an executable semantics of the programming language. That is, 
the oracle encodes the language semantics in the form of small-step SOS rules. 
Given a program (and necessary inputs for the program), the oracle is capable 
of executing the program according to the language semantics to produce the 
expected output, without going through the compiler to be tested. 

Creating an executable semantics for a programming languages is not trivial. 
It requires experience as well as effort. Nonetheless, it is desirable to have one 
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yule I1 + I2 => Ti +Int T2 
rule if ( true) S else =-ns 


rule [Allocate-Global-NonArrayType]: 
<k> #allocate(N, CN, #varInfo(X:Id, E:Value, T:NonArrayType, #storage, L)) => 
e/a 
<account > 
<acctID> N </acctID> <contractName> CN </contractName> 
<acctEnv> CONTEXT:Map => CONTEXT[X <- #storedVar(Slot +Int 1, T, #storage, 
1)] </acctEnv> 
<acctStorage> STORAGE:Map => STORAGE[Slot +Int 1 <- E] </acctStorage> 
<acctSlotsi> Slot => Slot tint 1 </acctSlotis> =o. 
</account > 


Fig. 2: Example SOS rules for Solidity [44] 


because it provides a reliable way to check the correctness of compilers, and 
it will save time and effort in the long term since it effectively reveals ambi- 
guities, inconsistencies and incompleteness. Many researchers have realised the 
importance of executable language semantics and have built foundations that 
we can work with, like the K framework [50], Redex [87], or Ott [5I]. There are 
already executable semantics for many programming languages, like C, JAVA, 
JavaScript, or Solidity, which represent a strong oracle for compiler testing. 

It is conceivable and in fact confirmed by our experiments that the oracle it- 
self can be buggy due to human errors in encoding the language semantics or due 
to ambiguity in the language semantics in the first place. However, even a po- 
tentially buggy executable semantics is much better than none for the following 
reasons. First, during the above-mentioned process, SpecTest is able to identify 
bugs in the oracle, which helps to improve the language semantics. Second, bugs 
in the semantics are overall less likely compared to compiler bugs since the com- 
piler must not only implement the semantics but also handle sophisticated code 
optimisations, which are known to be error-prone. 

In this work, we apply the K framework [50] as a basis for our oracle. The K 
framework provides convenient notations for defining language semantics or type 
systems based on rewriting rules, configurations, and computations. It comes 
with a range of supporting tools, like a parser, an interpreter, or a program 
verifier, which enable the execution of the specifications. In short, it combines 
the functionality of both the compiler and the runtime environment. Encoding 
small-step SOS rules in the K framework is relatively straightforward. For ex- 
ample, Fig. [2|shows three (simplified) rules defined for Solidity (i-e., a language 
for programming smart contracts) programs. In particular, the first rule shows 
how simple addition should behave for Integers, given the existing k construct 
for addition +1nt. The second example is a rule for an if conditional statement, 
where the condition is true and the result is the then-branch. Not all rules are 
simple though. The third example is a rule for the storage allocation of a global 
non-array variable. In general, the rules become more complex for sophisticated 
language features such as concurrency or higher order functions. 

In this work, we adopt and extend the K semantics for Java [24] and Solidity 
[B444] to implement SpecTest. The K semantics for Solidity, called KSolidity, 
has currently 304 rules. The K semantics for Java, called K-Java, has 1385 rules. 
K-Java was developed for an earlier version of Java (1.4) and some rules are dep- 
recated or unreachable. Our extension to these existing efforts concerned mainly 
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two aspects, i.e., extending them with proper interface and conversion so that 
they work with other components in SpecTest; and introducing a measurement 
feature for semantic coverage. For example, we enhanced the coverage engine of 
the old K version for K-Java, and we added a visualisation of the covered rules. 

Given a test case (in the form of a program with inputs), the executable 
semantics is used as follows. First, the test case is executed using the built-in 
execution engine of the K framework which fires the SOS rules one by one. The 
final variable valuations are captured as the result of the test case. For instance, 
for Solidity, we capture all the persistent states in the blockchain network (which 
includes addresses, their balances and the values of storage variables). This test- 
ing result is turned into an assertion in the test case. The test case with the 
assertion is then executed using the compiled program. If the assertion fails 
(e.g., the value of at least one variable is different), a bug is revealed. 

Simply applying the above-mentioned steps to test compilers would not be 
comprehensive. That is, existing seed programs often use a limited set of common 
language features and thus would not be able to test the compiler extensively. 
In fact, our experience on testing the Solidity compiler with existing smart con- 
tracts suggests that many smart contracts are suspiciously similar. As a result, 
the test cases would only exercise a limited set of semantic rules and thus would 
miss those bugs in the part of the Solidity compiler that encodes the remaining 
semantic rules. While collecting a large set of seed programs would likely be 
helpful, the larger problem at stake is whether there could be a certain quanti- 
tative measurement on the comprehensiveness of the test cases and whether we 
can use the measurement to guide the generation of new test cases? SpecTest’s 
answer to this question lies in the design of the mutator and the fuzzer. 


2.3 The Mutator 


Due to the high complexity of modern compilers, it is important that a meaning- 
ful coverage criterion is applied for compiler testing. Existing approaches either 
are not concerned with coverage or they use coverage criteria which are not ideal 
for compiler testing. Hence, we introduce our novel semantic coverage. 


Definition 1. Given R is the set of all semantic rules of our specification, T is 
the set of our given test programs, I, is the set of all possible inputs for the test 
program t € T, and cover(t,i,r) is a predicate that is true when there exists a 
test program t and a test input i € I; for t and they are able to fire the semantic 
rule r of our specification; our semantic coverage can be defined as follows: 
Yre R: ite T: di € I: cover(t,i,r) 

This means that to achieve semantic coverage (or at least increase it), it is not 
only important that we have good test programs, but also the test inputs for 
these programs are essential. In order to produce good test programs, we apply 
our mutations that inject language features to specifically target the uncovered 
rules as we will explain in detail in the following. The coverage of all rules r € R 
would give us full semantic coverage, but in reality this is often infeasible, hence 
we also depict it as the percentage of rules that are covered. 
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In SpecTest, we achieve high semantics coverage with the following two syn- 
ergistic parts. First, we design and implement a mutator which systematically 
introduces less-exercised language features into the test programs automatically. 
Second, we design and apply powerful fuzzing techniques to generate program 
inputs to exercise all statements including the less-used features in the test pro- 
grams. The latter can be achieved with fuzzers optimised for existing code cov- 
erage criteria such as branch or statement coverage. 

We believe that a comprehensive test suite for a compiler must cover all 
relevant aspects of the language semantics, and semantic coverage offers such a 
measurement. The above definition simply measures whether a rule is fired or 
not. It might be meaningful to further measure the context in which each SOS 
rule is fired (as certain bugs might only be triggered when a rule is fired in a 
certain context), which we leave as future work. 


To achieve high semantic coverage, SpecTest employs a two-part solution. 
Given the oracle’s feedback on which SOS-rules are not fired (or least fired), the 
language features which are associated with the SOS rules are identified. This is 
straightforward as each SOS rule is associated with a specific language construct. 
For instance, when the first rule of Fig. Bis not fired, then this would highlight 
that our test programs contain no addition between Integer variables. Next, the 
mutator takes the information and systematically mutates the seed programs to 
introduce these less-tested language constructs. 


The mutator is a code mutation engine which is designed to automatically 
mutate a given source program to generate new programs (i.e., test cases for the 
compiler). Existing mutation approaches [38]41]55) for compiler testing already 
applied mutators to generate test programs, but they mutate based on simple 
algebraic rules and are not systematic. For instance, equivalence modulo inputs 
(EMI) [ZI] works by injecting code into seed programs with the aim to achieve 
a high difference in the control- and data-flow compared to the original seed 
program in order to produce diverse test programs. In comparison, our mutator 
is designed to maximise semantic coverage. 


Implementing the mutator is not trivial. For SpecTest, the mutators for So- 
lidity and Java were implemented based on existing parsers through code instru- 
mentation. That is, given a language feature and a source program, the mutator 
first parses the source program to build an AST. Afterwards, it identifies poten- 
tial locations in the AST for introducing the features. Lastly, it systematically 
applies a mutation strategy specifically designed for the language constructs to 
inject them at all possible or specific pre-defined locations. In the following, we 
introduce three mutation strategies as examples. 


We investigated features that were specific for Solidity. For example, one mu- 
tation introduces modifiers for functions, which define conditions that must hold 
when a function is executed. Listing [1.1] shows a smart contract with modifiers 
written in the Solidity language. Unlike traditional programs, smart contracts 
cannot be modified once they are deployed on the blockchain. As a result, their 
correctness is crucial. So is the correctness of the compiler since the compiled 
programs are deployed on the blockchain. Furthermore, the Solidity compiler 
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1 contract AccessRestriction { bibt4QkDIfJ: { 

2 address public owner = msg.sender; bsJxhbtSJBu: { 

3 //default modifier: bHhq230wDjZ: { try 1 
4 modifier onlyBy(address account){ bEdqgz33tki9: { 

5 require (msg.sender==account, "Sender bVm9tCxbul4: { 


not authorized"); 


if (i >= 5){ break; } 


6 -; + //injected modifier: break bEdqZ33tKi9; 
7 modifier cgskst (address value){ Ip de 
8 require(value == address (0x0),""); }Jcatch(RuntimeException e){ 
9 3; } //injected modifier: bQ2yucCPLQr: { 
O modifier cbhsmo(address value){ System.out.print("X"); 
1 require(value == address (0x0) ,""); break bQ2yucCPLQr; 
2 -; + //injected modifier: Op dp de dp op 
3 modifier nlwxmv (address value){ 
4 require(value == address (0x0),""); Listing 1.2: Labelled block mutation 
5 $ 
6 }//Make newOwner the contract owner: contract Test { 
7 function changeOwner(address newOwner function testFunc(int a) 
) public onlyBy(owner) cgskst ( public pure returns (int) { 
address (0x0)) cbhsmo(address (0x0 int result = a + att; 
)) nlwxmv (address (0x0)){ //produces 3 when a is 1 
8 owner = newQOwner; return result; 
9 }} Dae, 


Listing 1.1: Simple modifier example Listing 1.3: Simple contract example 


has been under rapid development and there are unique language features with 
sometimes confusing semantics. Thus, it is a good target for evaluating the ef- 
fectiveness of SpecTest. In this example, the modifier onlysy ensures that the 
function changeOwner can only be called when the address of the contract owner is 
used. By integrating various dummy modifiers (Lines 7, 10 & 13) into our seed 
contracts and by adding them to functions (Line 17), we noticed that an older 
version of the Solidity compiler crashed in some cases, when more than a certain 
number of modifiers are used. Such a case is difficult to find with normal tests, 
since it is rare to use multiple modifiers for a function. Given that a less-fired 
SOS rule is concerned with the modifier construct in Solidity, to introduce mod- 
ifiers, the mutator scans through the AST for function declarations. For each 
function declaration, the mutator randomly adds one or more modifiers. 

We also introduced specific mutations for Java. For example, our experiments 
showed that semantic rules associated with labels were not fired. Hence, we in- 
troduced mutations that target these rules, e.g., a mutation that injects labelled 
blocks, which is a special and rarely used feature that allows an immediate exit of 
a block with a break statement. This mutation is illustrated in Listing [1.2] where 
we injected labelled blocks and breaks (with these labels) into a seed program. 

Both for Solidity and Java, we noticed that there are various rules in the K 
specifications (i.e., 11 rules Java and 17 for Solidity) concerning mathematical 
expressions that were not covered, e.g., computations with hex-values. In order 
to cover these rules and to cover unusual usages in different contexts, we relied 
on a random approach in contrast to the other mutations where we injected code 
at specific places. We developed mutations that produce a variety of mathemat- 
ical expressions combining various language features, like operations containing 
variables with different data types, hexadecimal, octal or binary literals, pre- 
and postfix increment /decrement (++/--), bitwise and bitshift operators, various 
combinations of unary operators and arrays. A simplified example of a muta- 
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tion produced with this strategy is shown in Listing [1.3] It can be seen that the 
increment operators (++) is used in an unusual context within a mathematical 
expression. Our experiments showed that the computation produced unexpected 
results, i.e., we found an issue with the computation order that caused the in- 
crement to be executed first, although it should be executed last [19]. 


2.4 The Fuzzer 


By injecting specific language features into the seed programs, the mutator in- 
creases the likelihood of firing uncovered or poorly covered SOS rules during the 
test execution. The fuzzer is a fuzzing engine which generates test inputs for 
a given program. The generation is based on optimization (e.g., using genetic 
algorithms). One of the required inputs for the fuzzer is a set of seed source 
programs. Such source programs are often abundant. For instance, there are 
thousands of Solidity programs (contracts) on EtherScan.io. The fuzzer takes 
these contracts as input and generates test inputs for each contract. During this 
process, the fuzzer sets up a test blockchain network, deploys the contracts, and 
generates a sequence of transactions which invoke functions. 

For Solidity, we applied an existing smart contract fuzzer called sFuzz [46] 
that works with a new adaptive fuzzing technique for maximising the branch 
coverage. sFuzz uses an optimised version of a technique called American Fuzzy 
Lop (AFL) [59], for producing inputs that can achieve a high branch cover- 
age. It includes various test oracles for the detection of general vulnerabilities, 
like Integer overflows, or smart contract specific vulnerabilities, like a gasless 
send [48]. We applied sFuzz to maximise the coverage of our test programs to 
cover our injected features. For our injected features, the coverage was usually 
easily achieved. However, for other cases or to minimise the test inputs, it might 
be necessary to customise the fuzzer to specifically target newly added language 
features. For example, during the mutation, we can record which parts of the 
contracts have been mutated and prioritise those parts during fuzzing. For Java, 
we did not apply a fuzzer, because the majority of our seed programs were simple 
in nature. A single run produced full coverage in almost all cases. 


3 Evaluation 


We have implemented SpecTest for two compilers, a compiler for a general pur- 
pose language (Java) and one for a new domain-specific language (Solidity). 
In the following, we design multiple experiments to systematically answer the 
following research questions (RQ). 

— RQ1: How effective is our proposed method in finding bugs or inconsisten- 
cies? This is important since the primary aim of SpecTest is to provide a 
systematic way of generating a test suite for identifying compiler bugs. 

— RQ2: What kind of bugs and inconsistencies can be found? To further mo- 
tivate the usage of SpecTest, it makes sense to point out what issues can 
be found. In particular, we would like to check whether indeed there are 
compiler bugs associated with less-fired SOS rules. 
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— RQ3: To what extent can the coverage of rules within the language specifica- 
tion be increased with specific mutations? The semantic rule coverage is one 
of the core aspects of SpecTest for finding bugs. Therefore, it is important 
to investigate to which extent we can increase this coverage. 

— RQ4: How much effort is it to apply SpecTest? When a tester is considering 
a testing method, the effort usually plays a big role. To create a good basis 
for a decision, we discuss the effort of applying SpecTest to two compilers. 


3.1 Test Setting 


As seed programs, we used existing tests cases of K-Java [24] and KSolidity 
[34]44]. KSolidity is still under development, which means that we could not test 
all features or a large set of contracts, but it was already sufficiently developed 
to support many interesting cases. K-Java supports most features of Java 8, 
but it also has limitations, i.e., it was implemented in an old version of the K 
framework, which did not focus on performance. Hence, we used seed programs 
without imports of libraries. We do not regard this as a limitation since small 
programs have advantages, e.g., they are easier to debug and it reduces the time 
for test case minimisation. Moreover, it is well-known that many bugs can be 
revealed by small test cases [32], which are also common in traditional testing. 

For Solidity, we had 37 seed programs that were part of the KSolidity project 
due to its early stage. Hence, it makes sense to apply SpecTest since it enables the 
generation of more test programs in a systematic way. Our mutator for Solidity 
is written with about 5,300 lines of Java code. In each test run, we applied one 
of our mutations (or in some cases also combinations) to the seed programs. We 
applied sFuzz to the mutated contracts and then converted the resulting test 
cases in a usable form for KSolidity. We primarily tested the Solidity compiler 
version 0.5.13, but initially also older versions. In some cases, we had to apply 
Truffle tests [2I] (v5.1.10) and for debugging we used Remix [I8], which facilitates 
a step-by-step exploration of the contract bytecode. 

For Java, we applied 756 seed programs and our mutator has about 6,100 lines 
of code. The mutations were similar as explained before. In contrast to Solidity, 
we did not need a sophisticated fuzzer since the mutated Java programs were 
covered easily. Our focus was Java 13 (openjdk 13, 2019-09-17, RE build 13+33- 
Ubuntu-1), but we also tested older versions (11 and 8). For the mutator, we 
applied JavaParser 3.14.3 for parsing the programs and for injecting mutations. 

The experiments for Solidity were performed on a Dell X1 Carbon with an 
Intel i7-8565U CPU with four 1.80GHz cores and 16 GB RAM, for Java on a 
PC with an Intel i7-7700 CPU with four 3.60GHz cores and 64 GB RAM. 


3.2 Experiment Result 


We ran more than 30,000 test cases for Java, which had a total execution time 
of about three weeks. For Solidity, we ran more than 50,000 test cases with a 
total execution time of about two weeks. Details about the distribution of the 
run time will follow below. The execution times are not exact numbers, since the 
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experiments sometimes were stuck due to out of memory exceptions, not enough 
space, etc. Unfortunately, we could not fully resolve such issues, because many 
mutations inject features with random aspects into the diverse seed programs. 
This caused various unpredictable situations, like endless loops or too large data 
structures. By adopting our mutator, we greatly reduced the number of such 
situations, but we could not remove all rare cases. 

RQ1: How effective is our proposed method in finding bugs or inconsistencies? 
We discovered issues and bugs both for Solidity and Java. Some of these issues 
were not found within the compiler or the runtime environment, but within 
the language semantics. Fixing such issues is also essential, since improving the 
specification is an important aspect of testing. 

In total, we found six issues for the Solidity compiler [T9[10], two were related 
to error/warning messages [7J13], and three of the other issues might have the 
same cause, i.e., the execution order. For KSolidity, we found eight issues, six 
of them were related to unimplemented features. For Java, we found four issues 
with the compiler [2[5], two of which were concerned with error messages [6[I2], 
and we discovered 13 issues with K-Java (eight issues or 
bugs, one warning related issue, and four minor issues, like a wrong output 
representation [16]). More details about the different types of issue follow below. 

Our experiments showed that SpecTest is able to reveal issues, inconsistencies 
and bugs. These issues were not only found in the compiler, but also in language 
semantics (which are developed independently by other groups with dedicated 
effort). One might argue that finding bugs/issues in the language semantics is not 
as meaningful as finding bugs in the compiler. We believe that it is also crucial 
to ensure the robustness of the semantics since in general the quality of the tests 
or specification are essential for the overall robustness of software. SpecTest 
was able to find various inconsistencies and bugs in the specifications, which 
is important for the specification developers, as well as issues in the compilers. 
We have spent effort on confirming our findings and out of the 31 issues, we 
submitted 19 to the corresponding git repositories and reported the other issues 
to the developers or to a bug reporting system. For 13 issues, we received a 
confirmation or the developers mentioned that they will investigate and fix them. 

An aspect that might have limited the effectiveness, is that we did not fully 
apply our method for Java, since we only tested simple seed programs and did not 
use fuzzing. We believe that the issues we found still showed that our method 
was reasonably effective, even though we only partially applied it. Using the 
full extend of SpecTest for Java might require a more powerful specification, 
which is a potential topic for future work. Moreover, it should be mentioned that 
KSolidity is still being developed and not as stable as the Solidity compiler (or 
runtime environment), since much more effort was invested into its development. 
This is similar for K-Java, and Java in general is robust due to its maturity. 

RQ2: What kind of bugs and inconsistencies can be found? We categorise 
our findings into three categories as illustrated in Table [1] i.e., (1) normal issues, 
bugs and missing features, (2) issues related to warning or error messages, and 
(3) minor inconsistencies or issues, like a small discrepancy in the output, e.g., 
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-oe+00.0 instead of -o.o [I6]. Table 1: Found semantics and compiler issues 
Additionally, we differenti- Solidity|KSolidity|Java]|K-Java 


i N Ti b 4 2 
ate whether the origin of an (= ore è 3 
arning or error 


issue was the compiler or the message related issues 2 7 2 1 
R ; 5 Minor issues - - - 4 
specification, as illustrated Total = 5 Z a 


by the rows of Table 
The most interesting issues that we found were the ones concerning the wrong 
computation order in Solidity. The cause of these issues were actual semantic 
errors within the compiler. Moreover, we also found various issues with error or 
warning messages. Such issues might seem trivial, but it is important to fix them 
since meaningless error messages can cause a huge waste of debugging effort. The 
bugs we found in the specifications had multiple sources, like the syntax parser, 
wrong semantic rules, partially implemented rules, or rules applied in a wrong 
context. Although K-Java and KSolidity had already many manual tests, we 
showed that SpecTest was able to discover many inconsistencies and bugs. In 
the following, we present example issues from the mentioned categories. 
Solidity Findings. One of the issues [19] that SpecTest identified was that 
there were wrong results, when we tested expressions with different assignment 
operators. The behaviour can be observed in the following example, where the 
increment operator is applied at first, but should be applied in the end. 


int a = 2; a *= 1 + a++; //results in 9 but should be 6 


A potential cause might be a wrong computation order. This issue was found 
since some SOS rules for assignment operators were uncovered. By creating mu- 
tations that target these rules, we could generate expressions like in the example 
which led to the discovery of the issue since the oracle predicted a different result. 
An inconsistency regarding an error message [13] was revealed when we tested 
computations with different data types. As illustrated below, we discovered that 
it is possible to add int variables with different bit sizes, but an error is produced 
if an int_ const is added to an int variable with a smaller bit size. 
int8 a = 10; int16 b = 234; 


int c b +a; // works 
ine e 234 + a;//TypeError: Oper. + incompatible with types int_const & int8 


In this case, our oracle performed the computation without an error, but the 
Solidity compiler produced a type error. For KSolidity, we found an incorrect 
overflow behaviour for computations, and that there is no support for numerous 
language features, like increment operators. 

Additionally, we applied our Solidity truffle tests to the Conflux blockchain 
[L7], which is a new alternative for Ethereum. It basically can be seen as another 
runtime environment for Solidity contracts. With our tests, we were able to 
reveal a bug in the testing environment that resulted in incorrect results when 
we injected formulas with unary and bitwise operators [4]. 

Java findings. Our experiments showed that there is an inconsistency [2] 
when casts from double and long variables to Integers are performed. These casts 
are handled differently by Java when an overflow occurs, i.e., in the following code 
the results will be the maximum Integer for the double cast and bits will be cut 
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off for the long cast. In K-Java both casts produce the same result, i.e., bits will 
be cut off. Although this behaviour is documented in the language specification 
and already others were wondering about this issue, we believe that the approach 
of K-Java is more consistent, and we are still waiting for a comment of the Java 
team about the motivation to handle these cases differently. 
System.out.printin(((int)2147483648L)); // -2147483648 
System.out.printin(((int) 2147483648.0)); // 2147483647 

A problem we found for the Java compiler [6] is a missing error message 
when a computation with a long and a double variable is performed. Normally, 
an incompatible types error is produced as illustrated in the following code, but 
the error does not occur when the same computation is done with an += operator. 


long a = 1L + 0.1 * 3L; // produces error: incompatible types: possible lossy 
long b = 1L; // conversion from double to long 
b k= On x SLi: // no error is produced 


We discovered that K-Java has an issue with the modulo operator [14]. The 
computation is wrong for all negative doubles and floats, i.e., it produces incon- 
sistent values compared to Java and compared to the same computation with 
Integer values. This is illustrated in the following examples. 


System.out.println("-8 h 3 = "+(-8 % 3)); //K-Java and Java return -2 
System.out.printin("-8.0 % 3.0= "+(-8.0 % 3)); //K-Java 1.0 Java -2.0 
System.out.printlin(" 8 ho =o) = 8078: % -3)); //K-Java and Java return 2 


System.out.printin(" 8.0 % -3.0= "+( 8.0 % -3.0));//K-Java -4.0 Java 2.0 


In general, we found most issues, when we injected mathematical expres- 
sions into the seed programs. This was an interesting finding for us, since these 
expressions are a major component of all programming languages, and we as- 
sumed it would be straightforward to develop a specification for them. However, 
it turned out that many interesting and ambiguous situations can occur when 
various combinations of operators, variables and literals are tested. 

RQ3: Can SpecTest effectively improve semantic coverage? The objective of 
SpecTest is to systematically generate a test suite for achieving better semantic 
coverage. In order to evaluate the coverage, we conducted the following experi- 
ments. We identified the semantic rules that were least covered by the existing 
tests for Solidity and Java, and then applied SpecTest systematically (with spe- 
cific mutators) and measured the improvement in terms of semantic coverage. 

First, we evaluated the semantic coverage criterion that is achievable with 
the original seed programs of K-Java and KSolidity to have a reference value for 
the comparison with the mutated test programs. Table 22] shows a comparison 
of the coverage from the original test cases from K-Java to our mutated test 
cases. The rule coverage of this early version of the K framework of K-Java is 
rudimentary. Hence, we could only measure the covered lines and characters of 
the rule files, and many of these files were already fully covered due to redundant 
or unreachable rules. Nevertheless, we were able to identify various uncovered 
rules in four of the files, and we produced mutations that covered these rules. 

KSolidity was built with a new version of the K framework, which has a 
better measurement of the rule coverage. Since the development of KSolidity 
is still ongoing, we focused on the completed features, like conditions, loops, 
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arrays, structs, simple Table 2: Comparison of the covered rules between the 
transactions, or mathe- K-Java tests (Default) and our mutated test cases 

matical expressions, and E Default Mutants |Difference 
managed to increase the pile Char] Line |Char| Line |Char]| Line 

8 EA folding.k 93.04[93.89|93.04|93.89|  - - 
coverage. Even with just unfolding.k 91.84 94.55|91.84|94.55| - - 
these features, we found process-class-decs.k 89.07 |92.95] 89.07 [92.95] - - 

i expressions.k 72.30 |78.74| 86.58 |89.92| 14.28 |11.18 
meaningful bugs. The process-comp-units.k 83.39 |86.03183.39 86.03) - - 
coverage improvements static-init.k 81.20 |82.35| 81.20 |82.35 - - 

pii process-class-members.k |80.65|83.53| 80.65 |83.53| - = 

compared to the original statements.k 80.51 [82.38] 80.51 |82.38] - 

seed programs are illus- new-instance.k 79.59 [82.41| 79.59 [82.41] - z 

: method-invoke.k 79.44 |80.74|79.44]80.74| - - 
trated in Table B] There api-core.k 61.77 [63.37] 78.74|81.82| 16.97 |18.45 

were partially imple- var-lookup.k 77.52|79.41|77.52|79.41] - 

mented features which process-type-names.k 76.56 |75.76| 76.56 |75.76| - - 

expressions-classes.k 73.03 |65.00] 73.03 |65.00| - - 

could not be fully cov- process-local-classes.k 67.62 |72.12| 67.62 |72.12| - - 

ered. The coverage of |process-anonymous-classes.k]| 66.79 |81.52|66.79|81.52| - - 

the completed. features arrays.k 62.07 |66.90| 62.07 |66.90| - - 
P api-threads.k 35.51 |39.04] 41.43 |47.01| 5.92 | 7.97 

was considerably im- syntax-conversions.k [40.65 [42.42] 40.65 [42.42] - = 
proved. literals.k 29.19 [34.31] 38.73 |42.72| 9.54 | 8.40 
We have shown that our mutations can increase the rule coverage both for K- 
Java and KSolidity. Our close investigation shows that the increase in coverage 


requires non-trivial programs (e.g., programs that specifically include missing 
language features) which are unlikely to be generated without our mutator. It 
is worth mentioning that writing mutations for the uncovered rules lead to the 
discovery of many issues. Moreover, the mutations that targeted specific semantic 
rules or language features could generally increase the coverage instantaneously 
with a single test, but we still applied them to all seed programs, and we also used 
general mutation operators to produce mutants for many different situations. 


RQ4: How much effort is it to apply SpecTest? To answer this question, we 
analysed the effort required to apply and implement SpecTest for Java and Solid- 
ity. It consists of two parts, the effort of applying SpecTest once it is developed, 
and the implementation effort. The latter one consists of three parts, the effort 
for developing the oracle, the mutator and the fuzzer. The goal of this analysis 
is to understand how generalisable SpecTest is to a new programming language. 


Applying SpecTest after the implementation has the following timing re- 
quirements. Both for Solidity and Java, the mutant generation took only a few 
seconds. For Solidity, we set a timeout of 2 min per contract for fuzzing and 
it took on average 24 min to finish all 37 contracts. Usually, 40-45 test cases 
were created by the fuzzer (normally multiple per contract depending on the 
mutation). Most test cases were executed by KSolidity within a minute, but 
there were outliers which did not terminate even after hours. Hence, we used a 
timeout of 5 min. On average, the testing time of KSolidity was 37 min (when 
five runs with different mutations were considered). For Java, we did not apply 
a fuzzer due to the simplicity of the seed programs. We executed the 756 test 
programs directly with K-Java, which took on average 3 hours and 51 min for 
an introduced mutation (for five runs with different mutation types). 
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We now discuss our development 
efforts and the time requirements of 
the implementation of SpecTest for a 
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Table 3: Comparison of the covered rules 
between the original KSolidity tests 
(Default) and our mutated test cases 


new language. In our case, the most File Default |Mutants[Difference| 
7 : function.k 86.21 86.21 - | 
effort went into the development of Souci | 231 | eee 505 —| 
the mutator and the supporting tool- _ |solidity-syntax.k| 62.50 | 72.12 9.62 
pp 
ing, like translators. The implementa- statement. | 92.86 | 92.86 - | 
‘ ee contract.k 94.12 94.12 z | 
tions for both Solidity and Java took driver k 19.15 49.15 = | 
about two to three months each. It solidity.k 50.00 | 50.00 - | 


should be noted that this time depends on the availability of existing tools, 
like a language parser or fuzzer. For this work, we relied on pre-existing lan- 
guage specifications, which helped to reduce the overall effort, but as mentioned 
they came with limitations, which caused additional efforts. Writing a specifica- 
tion for a new programming language is not trivial. Based on past experiences, 
we assume that it takes about six to 12 months depending on the complexity of 
the language. Given the many recent efforts on developing executable language 
semantics, we believe that SpecTest provides a good way to better utilise these 
existing specifications for systematic compiler testing. 


To summarise, the implementation effort of SpecTest is about two to three 
work months mainly for the mutator, if there is an existing specification and a 
fuzzer. The application of our method in terms of run time is about a few hours 
for a single mutation. Further increasing the number of seed programs, and 
performing a reasonable number of mutations increases this time to a couple of 
days or weeks, when the tests are only executed on one machine. Even though 
this seems like a lot of effort, we believe our method is still worthwhile, since it 
will pay off eventually, especially considering all the effort that can be required 
for releasing a new compiler version, when serious bugs are discovered. Moreover, 
our method can be easily accelerated by distributing it to multiple machines. 


As mentioned before, the implementation effort for our method was about 
two to three work months. This is about the time that is needed for the mutator 
and for other minor tools. It does not include the effort for creation of the 
language specification or the fuzzer. There are already many existing fuzzers that 
could be adopted for new programming languages, and also numerous language 
specifications. We especially want to recommend our method for all languages 
with pre-existing specifications (or when similar specifications exist) since then 
there is only a small implementation effort, which will soon be mitigated by the 
advantages of SpecTest. Even when there are no pre-existing specifications for 
a language, we highly recommend to create one and to adopt our method, since 
it will save time in the long term. 


An effort that should not be underestimated is the time for analysing bugs. It 
can be troublesome and to find the cause of a bug, due to the complexity of the 
test cases, i.e., it sometimes took us hours or even days. In such cases, it can be 
helpful to minimise failing test cases. There are numerous techniques, like delta 
debugging or program slicing |58|, which can reduce the debugging effort, 
and integrating them into SpecTest would be interesting for future work. 
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3.3 Threats to Validity 


A threat to the validity of our evaluation might be that we did not show a 
comparison to other compiler testing methods. A comparison might be inter- 
esting, but our main goal was to show the general applicability and usefulness 
of SpecTest for different compilers. It would not be fair to compare SpecTest 
to other testing techniques that focus on different types of bugs, e.g., it might 
be much easier to find simple parsing errors caused by unusual characters (with 
techniques, like fuzzing). 

One might argue that the test size we used is too limited, which might be a 
potential threat to the validity of our evaluation. It is true that it would make 
sense to apply more seed programs and to continue mutating and testing for an 
extended period of time. However, due to restrictions of KSolidity and K-Java, 
a larger set of seed programs was not supported, and due to a limited time and 
computing budget, we did not execute more tests. Nevertheless, we believe that 
our test size was reasonable, since it allowed us to reveal various issues and bugs. 

Another threat to the validity of our evaluation might be that we should 
not have just relied on existing specifications, where we cannot be sure about 
their quality. It is true that we might have more confidence in a specification 
that we created, but since SpecTest checks the correctness of compilers as well 
as specifications, we have trust that our specifications had a reasonable quality. 


4 Related Work 


Compiler testing is a broad research field with a range of techniques that target, 
e.g., the test case generation [49J31]23] or the oracle problem [22]. Several surveys 
give an overview of these methods [562689125]. Our study however shows that 
existing approaches suffer from two weaknesses. They do not apply a test case 
generation that can extensively cover rare language features, and they often 
rely on weak or limited test oracles. The test case generation often works with 
standard code coverage criteria concerning compiler components. For example, 
Zelenov and Zelenova [6I] applied a BNF grammar as a model and produced 
test cases according to, e.g., code or functional coverage of a syntax analyser. A 
method based on the coverage of context-free grammar rules was presented by 
Purdom [49], but it only targets the parser of the compiler. Kalinov et al. 
defined coverage criteria based on a statement machine specification. In contrast 
to our work, they do not identify rare language features by analysing semantic 
rule coverage, and they do not construct their test programs via code mutation. 

Various compiler testing methods work without any coverage by just ran- 
domly generating test cases according to a grammar, which defines valid pro- 
grams [52/60]. There are also techniques that use mutation for producing test 
cases [38]41]55]. For example, Le, Sun, and Su [4I] produced mutants that should 
have the same behaviour as the original programs in order to find cases where the 
behaviour diverges. However, in contrast to our work, they are not considering 
a semantic coverage for less used language features. 
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Several attempts have been presented to answer the oracle problem for com- 
piler testing. In the simple case of positive/negative testing, an oracle only tells 
whether a program is compilable. When a test program is compiled, the result is 
checked to see if it matches the expectation of the oracle. A match means a suc- 
cessful compilation. Otherwise, there may be a bug. For example, Zelenov and 
Zelenova illustrated a specification-based approach for generating positive 
and negative tests. Such approaches are limited to testing the syntax parser. 


In the line of work on differential testing compilers [45], the oracle is defined 
as consistency among two or more compilers for the same language. In this 
method, the same test programs are given to multiple compilers and the results 
are compared. If there is a difference then a bug in one of the compilers or an 
ambiguity in the language is found. There exist different versions of differential 
testing as explained by McKeeman [45]. Cross-compiler testing is a technique 
that works by contrasting a new compiler against a pre-existing compiler that 
has the same specification. When the same test programs are executed with 
both compilers, a different result can reveal a fault in the new or pre-existing 
compiler. Sometimes this technique is also called randomised differential testing 
[60], because the test programs are usually generated randomly, e.g., based on 
a grammar. Another differential testing technique is cross-optimisation testing, 
where programs compiled with different optimisations implemented for the same 
compiler are contrasted to find bugs. Le, Sun, and Su [42] presented such a 
technique for stress testing link optimisers. Their method generates random test 
programs and injects various function calls into different code regions in order to 
increase dependencies between procedures, and it also randomly selects different 
optimisation levels to produce challenging tests for the optimiser. Cross-version 
or regression testing is another differential testing method that tries to find bugs 
by comparing different versions of the same compiler. For example, Sun, Le, 
and Su [54] developed Epiphron, a tool that generates random test programs to 
find inconsistencies with the debug information, like missing warning messages, 
in different versions of the same compiler. Such approaches work only if there 
are multiple relatively mature compilers for the same language. In contrast to 
these techniques, SpecTest works with a formal language specification which 
is especially useful when no compilers could be used as a reference. Moreover, 
different compilers or compiler versions for the same language might still suffer 
from the same bugs, which is unlikely for an independent specification. 


There are approaches that assume the existence of a reference compiler, i.e., 
the oracle is an existing formally proven compiler. For example, Leroy pre- 
sented CompCert, a compiler for a subset of C, which was verified with the 
proof assistant Coq. However, there are usually no such compilers for a newly 
developed language and the existing ones cover only subsets of languages since 
formally proving a compiler is extremely challenging. 


For metamorphic testing [57], the oracle is defined as certain algebraic prop- 
erties of the compiler. For instance, one such property explored in the compiler 
testing technique called equivalence modulo inputs (EMI) is that a mod- 
ification on a program part which is never executed should not alter the result. 
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Based on this simple oracle, EMI works by randomly pruning dead code (i.e., 
code which is not executed given a certain program input) or by randomly insert- 
ing or removing instructions from dead code based on a Markov Chain Monte 
Carlo method. Such approaches are limited to identifying bugs which violate the 
algebraic properties. Hence, they are not able to find deep semantic errors. 

The closest related work to SpecTest was proposed by Kalinov et al. [35/36], 
where a language specification in the form of abstract state machines and mon- 
tages is used as an oracle. With this specification, they compare the expected 
output from the specification to that of a compiled program in order to check 
whether there are compiler bugs. This approach is limited by the choice of the 
specification language and it quickly becomes infeasible, because the computa- 
tion time is too high. Moreover, it is not concerned with semantic coverage. 

To demonstrate the limitations of the closely related methods, we come back 
to the example of Sect. |2| i.e., we discussed a bug with the increment operator 
that we discovered during our analysis of the Solidity compiler. 


int a soi} int result = a + a++; //produces 3, but it should be 2 


In this example, the compiler had an issue with the computation order, which 
resulted in wrong results. Existing approaches, like EMI or differential testing 
might be able to detect such issues, but with EMI it is difficult to find mutations 
that lead to such cases. The same is true for differential testing and there is also 
a high chance that different compiler versions have the same faulty behaviour 
for such a case (e.g., all versions of the Solidity compiler had this issue). 


5 Conclusion 


We have demonstrated our novel compiler testing technique called SpecTest 
that targets less-used language features. SpecTest is based on three components: 
an executable language specification, a fuzzer for generating test inputs, and 
a mutator which generates new programs by injecting rare language features. 
Comparing the abstract execution of the specification to the concrete execution 
of a compiled program enables our method to find deep semantic errors as well 
as inconsistencies and issues in the specification. 

We evaluated SpecTest by applying it to two programming languages: Java 
and Solidity. The results are encouraging. We discovered various issues con- 
cerning the compilers and the language specifications. Some of them helped to 
improve the quality of the compilers and many will enhance the specifications. 

In the future, we plan to further explore the generality of SpecTest for other 
languages, and we intend to consider different types of executable specifications. 
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Abstract. Proactive adaptation, in which the adaptation for a system’s 
reliable goal achievement is performed by predicting changes in the envi- 
ronment, is considered as an effective alternative to reactive adaptation, 
in which adaptation is performed after observing changes. When predict- 
ing the environmental changes, the prediction may be uncertain, so it is 
necessary to verify and confirm an adaptation’s consequences before ex- 
ecution. To resolve the uncertainty, probabilistic model checking (PMC) 
has been utilized for verification of adaptation tactics’ effects on the goal 
of a self-adaptive system (SAS). However, PMC-based approaches have 
limitations on the state-explosion problem of complex SAS model verifi- 
cation and the modeling languages supported by the model checkers. In 
this paper, to overcome the limitations of the PMC-based approaches, 
we propose an efficient Proactive Adaptation approach based on STA- 
tistical model checking (PASTA). Our approach allows SASs to mitigate 
the uncertainty of the future environment, faster than the PMC-based 
approach, by producing statistically sufficient samples for verification 
of adaptation tactics based on statistical model checking (SMC) algo- 
rithms. We provide algorithmic processes, a reference architecture, and 
an open-source implementation skeleton of PASTA for engineers to apply 
it for SAS development. We evaluate PASTA on two SASs using actual 
data and show that PASTA is efficient comparing to the PMC-based 
approach. We also provide a comparative analysis of the advantages and 
disadvantages of PMC- and SMC-based proactive adaptation to guide 
engineers’ decision-making for SAS development. 


Keywords: Self-adaptive system - Proactive adaptation - Statistical 
model checking - Environmental uncertainty 


1 Introduction 


As the complexity of an environment that affects a system’s goal achievement in- 
creases, analyzing the environment becomes important for reliable goal achieve- 
ment. The environment, such as user traffic and outdoor temperatures, can 
change over time [15,29]. Full anticipation of environmental changes at the sys- 
tem design time is challenging and often impossible [6,9]. Systems are required 
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to be self-adaptive so that they change their behaviors and structures accord- 
ing to the environmental changes at runtime. To realize this, numerous design 
approaches [11,13,14,16] have been proposed based on the MAPE feedback loop 
[18]. These adaptation processes involve the continual monitoring and analysis 
of the environment as well as the planning and execution of the adaptation. 

For most existing approaches, adaptation has been reactively triggered by 
system failures or changes in the environment [12,31,33]. Other adaptation ap- 
proaches, known as proactive or predictive adaptation, have emerged, which 
have proven to be more effective than reactive adaptations in a changing envi- 
ronment by predicting changes in advance [2,24,26]; however, the prediction of 
environmental changes is uncertain, so the uncertainty affects the consequences 
of proactive adaptation. To resolve the uncertainty, probabilistic model checking 
(PMC) was utilized in some studies for the verification of adaptation tactics and 
their effects on the system’s adaptation goal [5,26,27,28]. 

PMC-based approaches are a major method used for proactive adaptation; 
however, PMC may be not appropriate for the verification of large and com- 
plex self-adaptive system (SAS) models due to the state explosion problem. 
PMC requires a high verification cost in time and memory to fully examine 
the given probabilistic models, so the verification of complex SAS models and 
adaptation tactics may fail due to time and memory constraints. In addition, 
modeling languages supported by probabilistic model checkers must be used for 
the modeling of the SAS and the environment. Engineers must be familiar with 
modeling languages, such as Markov chains, Markov decision processes, or au- 
tomata, that model checkers can interpret [21]. To overcome the limitations, we 
propose an efficient proactive adaptation approach based on statistical model 
checking (SMC) that consumes a smaller verification resource than PMC and 
only requires simulation results of system models without limiting languages. 

Our Proactive Adaptation approach based on STAtistical model checking 
(PASTA) offers the following contributions: 


— We propose a proactive adaptation approach utilizing SMC to eliminate the 
uncertainty of the future environment faster than PMC for the verification 
of adaptation tactics. 

— We provide algorithmic processes, a reference architecture, and an open- 
source implementation skeleton of PASTA for developers who will apply 
PASTA to SAS development. 

— Based on evaluations using actual data, we also provide a comparative anal- 
ysis of the advantages and disadvantages of PMC- and SMC-based proactive 
adaptation to guide engineers’ decision making. 


The remainder of this paper is organized as follows. Section 2 introduces 
related work of proactive adaptation. Section 3 provides the background knowl- 
edge of SMC. Section 4 presents an illustrative example. Section 5 introduces our 
PASTA approach. Section 6 evaluates PASTA based on two SASs with actual 
data. Section 7 reveals the threats and validity of our work. Section 8 concludes 
the paper. 
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2 Related Work: Proactive Adaptation 


Numerous studies on proactive or predictive adaptation have been conducted 
to address issues related to changing environments [3,20,24,25]. As opposed to 
reacting to changes in the environment or system, predicting and responding to 
the predicted situations could be more difficult but more effective in prevent- 
ing system failures and meeting requirements. Many case studies on proactive 
adaptation have been conducted, and it has been demonstrated that proactive 
adaptation outperforms reactive adaptation in terms of the system’s adaptation 
goal {2,10,20]. For proactive adaptation, the prediction of the future environment 
is uncertain, so approaches utilizing probabilistic model checking (PMC), which 
verifies the property satisfaction of probabilistic model, have been proposed to 
provide verified and trustworthy proactive adaptation results [5,26,27,28]. The 
main process of PMC-based proactive adaptation is illustrated in Fig. 1. Core 
of the process are the formal modeling of the future environment, system, and 
adaptation tactics, and the verification of the models to identify an optimal 
adaptation tactic for adaptation goal achievement. However, PMC is not appro- 
priate for the verification of large and complex models due to its state explosion 
problem. It requires exhaustively examining all possible states of SAS models 
to verify adaptation tactics. It also requires engineers to develop SAS models 
written in modeling languages that model checkers can support. To tackle the 
limitations, as an alternative to PMC-based approaches, which have been the 
major trend of proactive adaptation, in this paper, we propose a statistical model 
checking (SMC)-based proactive adaptation approach [19,23,34]. 
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Fig. 1. PMC-based proactive adaptation process 
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3 Background: Statistical Model Checking (SMC) 


We have utilized statistical model checking (SMC) to verify adaptation tactics 
at runtime under an uncertain environment. SMC is an efficient technique for 
verifying a stochastic model [22,23]. Although PMC exhaustively examines the 
model, SMC simulates the model to obtain samples and provides statistical 
evidence of the satisfaction or violation of the given property using hypothesis 
testing for the samples. In fact, SMC requires only a set of simulation results, so 
it can be applied to an executable black-box model or to only a set of simulation 
results. The fact that the verification results depend on the quality of the model 
is the same as PMC. However, as it is a simulation-based approach, it is known 
to be an efficient alternative to PMC in terms of time and memory, performing 
verification with a certain confidence [1,19]. In this regard, SMC can be used 
effectively for the runtime verification of SAS adaptation tactics with uncertain 
environments. The following examples of SMC algorithms are widely used: 


— Simple Monte Carlo Simulation (SMCS). This is the simplest and 
most intuitive SMC algorithm [1,4]. It estimates the quantitative satisfaction 
of a property according to the ratio of samples that satisfy the property in 
the overall samples. It requires a fixed number of samples from the user. 

— Single Sampling Plan (SSP). The SSP [34] tests a hypothesis H : p > 0 
with fixed-size samples, where p is the probability that a system meets a 
given property and @ is the verification threshold of p. The user provides two 
error bounds a (0 < a < 1) and 8 (0 < 8 < 1) of false negatives and false 
positives, respectively. Within the given error bounds, the SSP estimates p 
to accept or to reject H. The detailed algorithm can be found in [19,23,34]. 

— Sequential Probability Ratio Test (SPRT). Similar to the SSP, the 
SPRT [32] tests a hypothesis H within the given error bounds, but the num- 
ber of samples is determined automatically. It simulates the target system 
to obtain a sample, and iterates the simulations to generate sufficient sam- 
ples until it can accept or reject H within a given error bound. The detailed 
algorithm can be found in [19,23,34]. 


For the PASTA approach, an SMC algorithm is selected and used to obtain 
statistical evidence of an adaptation tactic’s performance in a future environment 
to evaluate possible tactics and to identify the optimal tactic at runtime. 


4 Illustrative Example 


We illustrate PASTA using an adaptive air condition control system as an ex- 
ample. The system monitors indoor and outdoor air conditions, including tem- 
perature and humidity, and adaptively controls the indoor condition for a given 
target condition. Planning an adaptive air condition control with an immediate 
reaction to the monitored indoor condition can aid the system in achieving its 
goal; however, the indoor air conditions may change over time due to the influ- 
ence of the outdoor air conditions, as shown in Fig. 2. If the adaptation plan 
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is made without taking the environmental change into account, the adaptation 
consequences may differ from the expectations, and thus there could have been 
a better adaptation tactic that was not chosen. The air condition control system 
developed by the PASTA approach forecasts future air condition changes and 
selects an optimal adaptation tactic whose adaptation consequences are verified 
by SMC at runtime. Throughout this paper, we will describe our approach using 
this example. 
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Fig. 2. Adaptive air condition control system 


5 Proactive Adaptation Based on Statistical Model 
Checking 


5.1 PASTA overview 


We propose the PASTA approach, which is a proactive adaptation, using SMC. 
Fig. 3 presents the overall adaptation process. The aim of the approach is to 
provide efficient proactive adaptation based on the prediction of environmen- 
tal changes and the verification of the adaptation tactics of the SAS. (Step 1) 
Initially, PASTA continuously monitors the environment to capture its change 
at runtime. (Step 2) It analyzes the monitored (historical) environment data 
and forecasts future environmental changes based on its forecasting algorithm. 
The prediction or expectation of the future environment is in the form of non- 
deterministic possibility, such as the probability density function of future envi- 
ronmental conditions. (Step 3) Based on the prediction, a sample of the possible 
future environment is made and given to the simulation engine as a simulation 
environment. (Step 4) In the given environment, an adaptation tactic is applied 
to the system model and simulated to make a sample evaluation of the tac- 
tic’s performance. The simulations are repeated until the system obtains the 
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statistically sufficient number of samples for the verification of the tactic’s per- 
formance for the adaptation goal in the expected future environmental change. 
(Step 5) Based on the accumulated samples, the performance of an adaptation 
tactic is verified. All adaptation tactics are evaluated repeatedly in the same 
manner, and the SAS statistically guarantees the effects of its adaptation tac- 
tics. (Step 6 and 7) When all possible adaptation tactics have been evaluated, 
an optimal adaptation tactic is chosen and executed. This adaptation process 
is continuously repeated to respond to continuous environmental changes. We 
describe the PASTA approach in detail based on this adaptation process in the 
subsequent sections. 
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Fig. 3. Overall PASTA process 


5.2 Knowledge 


Principle. The PASTA approach requires an SAS to accumulate the monitored 
environment data. The accumulated historical environment data is analyzed to 
predict environmental changes. Furthermore, the system has its current system 
model that is an abstraction of the system behavior executable by a simulator. 
The model in PASTA is user-specific, and although the modeling language and 
system information to be modeled are selected by the engineer, the only require- 
ment is that the model is executable to generate simulation logs. The system 
model also contains a finite set (space) of possible adaptation tactics that will 
be verified. An adaptation tactic is a specification of an adaptation that can be 
applied to the SAS and its model, such as a set of configurations. The adap- 
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tation goal is also specified in the knowledge. Thus, the optimal tactic for the 
adaptation goals will be selected and executed. 

Example. The environmental factors of interest in the adaptive air condition 
control system are the indoor/outdoor temperature and humidity; therefore, the 
monitored environment data at a specific time include values of four factors. The 
simulation models imitate the changes of the indoor temperature and humidity 
affected by outdoor conditions and the air condition control system’s control 
values. The system’s possible adaptation tactics are defined by the system ca- 
pabilities of each temperature and humidity control capability. For example, the 
system can increase or decrease the temperature and humidity in 0.1°C and 
0.1% increments up to 5°C and 5%, respectively, in a discrete simulation time 
unit. The tactic space is a Cartesian product of the possible temperature and 
humidity controls. The adaptation goal is to manipulate the indoor temperature 
and humidity to the user’s desired conditions. 


5.3 Monitoring Environmental Changes 


Principle. (Step 1) The system constantly monitors the environment. The en- 
vironment is measured as the values of the environmental conditions observable 
by the sensors. The current environmental data are added to the environment 
database. The current state of the system is also monitored, and the system 
model is kept up to date. 

Example. The air condition control system constantly monitors the in- 
door/outdoor temperature and humidity. It accumulates the environment data 
in its environment database. 


5.4 Forecasting Future Environmental Change 


Principle. (Step 2) PASTA forecasts future environmental changes based on 
the accumulated historical environment data using a data analysis or forecast- 
ing techniques. As the given historical environmental data consist of time-series 
data, a time-series analysis and forecasting methods, such as random walk [30], 
errortrend-seasonal [17], autoregressive integrated moving average model [7], or 
any machine-learning techniques, can be applied, and the choice of the fore- 
casting methods depends on domain engineers. What is important here is that 
the predictions of future environmental changes based on historical data are 
uncertain, so the results of the forecasting are non-deterministic expectations, 
such as the probability density function of future environmental conditions. This 
uncertainty will be resolved by SMC. 

Example. The system predicts the outdoor temperature and humidity chan- 
ges, which exhibit distinct repetitive patterns (seasonality) at 24-hour intervals. 
As the environmental data of this system exhibit distinct seasonality, they can 
be predicted naively with a random walk model using seasonal differencing [17]. 
Based on the historical temperature data and the forecasting algorithm, the 
temperature change from the present to a few hours later can be predicted using 
the probability density function. For example, if the current temperature at 2 
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p.m. is 24°C, the temperature at 3 p.m. can be expected to change according to 
the uniform distribution between 24°C' and 30°C. 


5.5 Planning Adaptation Using SMC 


Algorithm 1: PASTA adaptation planning 
Input : envPrediction, sysModel, tacticSpace, goalProp 
Output: optimalTactic 
Procedure 
evaluationSheet = |]; 
foreach tactic in tacticSpace do 
simulationResultList = |]; 
while /samplesSufficient() do 
envSample = makeSample(envPrediction); 
simResult = simulate(enuSample, sysModel, tactic); 
addElement (simulationResultList, simResult) ; 
end 
evaluationResult = verify (simulationResultList, goalProp); 
addElement (evaluationSheet, (tactic, evaluationResult)) ; 
end 
optimalTactic = getOptimalTactic (evaluationSheet) ; 


end 


Principle. The adaptation planning of the PASTA approach involves search- 
ing for the optimal tactic among possible adaptation tactics using SMC, as shown 
in Algorithm 1. Evaluating an adaptation tactic using SMC consists of three 
steps: sampling environmental changes, simulating adaptation tactics, and veri- 
fying the simulation results. (Step 3) The forecasting result is non-deterministic, 
so the sample generator produces a deterministic sample of possible future en- 
vironmental conditions based on the forecasting result. SMC eliminates the un- 
certainty of the nondeterministic future environment by producing statistically 
sufficient samples, while PMC probabilistically verifies a stochastic model. The 
number of samples is determined depending on the SMC algorithms, as explained 
in the background section. (Step 4) The simulator takes the sample environment, 
the system model, and an adaptation tactic as inputs. It applies the given tactic 
to the system model, simulates the system in the sample of the future envi- 
ronment, and returns a simulation result logs that represents the effects of the 
adaptation tactic in the future environment. (Step 5) The verifier receives the 
numerous simulation results and evaluates the tactic’s performance for the adap- 
tation goal represented as a verification property. This process is performed for 
all adaptation tactics, and (Step 6) the optimal tactic is selected based on all 
evaluation (verification) results. Therefore, the planning time required for an 
adaptation depends on the number of tactics, the number of required samples, 
and the time for a single simulation of the model. 
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Example. Based on the predicted range of the temperature change at 3 p.m. 
(24°C ~ 30°C), the samples of the future outdoor temperature (for example, 
25°C, 27°C, and 29°C’) are randomly selected by an SMC algorithm. The sys- 
tem model and an adaptation tactic (for example, lower the indoor temperature 
by 3°C) under the current evaluation are simulated with the sample environ- 
ments, respectively. Based on the simulation results, the verifier evaluates the 
adaptation results of the indoor temperature control. In this example, the av- 
erage distance between the target condition and the current condition is used 
as a verification property representing an adaptation goal, but the maximum 
distance indicating the worst case, the presence or absence of events occurring 
with small probabilities, or any temporal logic can be used as verification prop- 
erties [19,23,34]. When all possible temperature and humidity control tactics are 
verified (evaluated), the optimal one is selected. 


5.6 Executing Adaptation 


Principle. (Step 7) The chosen optimal adaptation tactic is applied to the 
managed system by the actuators of the system. 

Example. The adaptive air control system operates the selected optimal 
temperature and humidity control. The controls affect the indoor conditions 
through the system’s actuators. 


5.7 PASTA Implementation 


We also provide a PASTA reference architecture in Fig. 4 for the implementation 
of this approach. It is a layered architecture of an SAS with the PASTA approach. 
In the interaction layer, PASTA monitors the environment and managed system 
through the sensor and affects them through the actuators, like typical SASs. In 
the data analysis layer, there is a forecasting engine for the prediction of environ- 
mental changes and a knowledge management module for keeping the knowledge 
of the system up-to-date at all times. In the adaptation planner layer, a module 
searches for the optimal adaptation tactic through interactions with the adap- 
tation verification layer. In the adaptation verification layer, the SMC module 
verifies an adaptation tactic governing the sample generator, the simulator, and 
the verifier. 

The sample generator produces samples of the future environment based on 
the prediction of the forecasting engine. The simulator simulates the system 
model with an adaptation tactic in the given sample future environment. The 
verifier analyzes the simulation results to check the adaptation goal achieve- 
ment, such as quality of service or invariant properties. In the knowledge layer, 
there is an environment database, a system model manager, an adaptation tac- 
tic repository, and an adaptation goal manager. This layer interacts with the 
others, providing and updating the knowledge of the SAS. This architecture is 
a reference, so it includes the essential components of an SAS with the PASTA 
approach and can be extended. 
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Fig. 4. PASTA reference architecture 


In addition, to support engineers who develop SASs based on the PASTA ap- 
proach, which was explained in the previous sections, we implemented a PASTA 
skeleton based on the reference architecture with guiding comments and released 
the source code on an open-source repository!. The skeleton is available in Java 
and Python. Engineers should write application-specific codes following com- 
ments tagged with “todo”. The class diagram of the skeleton is presented in Fig. 
5. An adaptation is activated by the “adaptManagedSystem” operator. It pro- 
motes easier PASTA implementation, allowing for the utilization of third-party 
libraries or tools for some components, such as the forecasting engine or the 
SMC module. 


6 Evaluation 


6.1 Research Questions 


We demonstrate the feasibility of applying the PASTA approach as one efficient 
alternative to PMC-based proactive adaptation to SAS development. There are 
three research questions addressed. 

RQ1: (Cost efficiency of PASTA) How fast is PASTA’s adaptation 
planning? PASTA leverages SMC for efficient adaptation verification at run- 
time. Although almost all existing proactive adaptation approaches utilize PMC 
for the runtime verification of adaptation tactics, the PASTA approach is one of 
the most efficient alternatives to PMC-based proactive adaptation approaches. 
To determine the efficiency of PASTA, we compare the application planning time 
of PASTA and the PMC-based adaptation. We confirm the differences in time 
consumption between SMC- and PMC-based approaches in solving proactive 
adaptation problems of the same complexities. 

RQ2: (Adaptation planning accuracy of PASTA) How accurately 
does PASTA search for the optimal adaptation tactic? PMC formally 
examines a probabilistic model and verifies whether it satisfies the given proper- 
ties; however, SMC examines the given model with numerous sample simulation 


1 https: //github.com/yongjunshin/PASTA 
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Fig. 5. Class diagram of the PASTA skeleton 


results, so it returns the statistical evidence of the model’s properties and thus 
has the inevitable limitation that it can return inaccurate verification results 
limited to the finite number of samples. It is known that SMC can produce 
results similar to PMC [19,23,34], and for this research question, we compare 
the similar proactive adaptation planning results of PASTA with the planning 
results of the PMC-based approach. We determine how much accuracy has been 
lost by the cost savings identified in RQ1 as well as whether the loss of accuracy 
is acceptable. 

RQ3: (Adaptation performance of PASTA) How effective is the 
adaptation goal achievement performance of PASTA? For research ques- 
tion 3, we examine whether the PASTA approach is actually effective in achiev- 
ing the adaptation goals of SASs. To evaluate the adaptation performance of 
PASTA, we compare the simulation execution results of approaches taking no 
adaptation, reactive adaptation, PMC-based proactive adaptation, and PASTA. 


6.2 Evaluation Setup 


We evaluate the PASTA approach using two example SASs. One is the adap- 
tive air condition control system, the illustrative example of this paper, and the 
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Fig. 6. Adaptation tactic of traffic signal controller 


other is an adaptive traffic signal controller of an intersection. The flow of cars in 
cities changes with the passage of time, which causes traffic congestion. A smart 
traffic signal controller that automatically controls traffic flow is a good exam- 
ple of applying proactive adaptation because changes in traffic conditions can 
be predicted based on historical data. Our signal controller predicts the traffic 
volume in an intersection and identifies an optimal configuration of signal pat- 
terns that minimizes the number of waiting vehicles. An actual signal controller 
is abstracted, and durations of signal patterns are dynamically controlled, as 
shown in Fig. 6. We applied PASTA to the two cases of different complexities 
and simulated them based on actual data acquired from public data repositories 
to make them realistic. Detailed descriptions of the two SASs and the evaluation 
setup are provided in Table 1. 

We compared the adaptation cost, accuracy, and performance of the PASTA 
approach with the PMC-based proactive adaptation approach. The PMC-based 
proactive adaptation approach was implemented following a pioneering paper 
[26]. PRISM, a widely used probabilistic model checker, was utilized in the im- 
plementation [21]. We used default hybrid computation engine. The models of 
environments, systems, and tactics were specified in Markov decision processes 
(MDPs), and the adaptation goals were specified in the reward-based properties 
of the MDPs. As in paper [26], the following environmental changes have been 
predicted based on the data, and the PRISM modules have been constructed and 
verified based on the prediction. Thus, the optimal adaptation tactic has been 
found. In addition to the PMC-based approach, non-adaption and reactive adap- 
tation approaches were also compared in terms of a system’s goal achievement. 
For the PASTA approach, SMCS, the naivest SMC algorithm as explained in 
the background section, was implemented and evaluated by varying the number 
of samples used for the verification from 10 to 10000 (10, 100, 1000, 2000, ..., 
9000, 10000). 


6.3 Evaluation Results 


RQ1: We measured and compared the time spent on adaptation planning for 
both case systems using the PASTA and PMC-based approaches. The adap- 
tation planning time includes modeling or sampling time and probabilistic or 
statistical verification time to identify the optimal tactic. Figs. 7 and 8 show the 
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Table 1. SASs for evaluation 
Adaptive air condition con-|Adaptive traffic signal con- 
trol system troller 
Environment oo and humidity condi- Car inflow to an intersection 
12 environmental factors (the 
Environment |2 environmental factors (tempera-|number of car inflow from 4 source 
complexity |ture, humidity) roads to other 3 destination roads: 


12 directions) 


Source of 


Open weather data portal of - 


Open traffic data of Daegu, South 


real envi- South Korea (https://data.kma. |Korea (https://car.daegu.go.kr) - 
ronmental go.kr) - 2018 hourly weather data]2018 Daily&Hourly Traffic data of 
data of Seoul an intersection in Daegu 
System Indoor air condition controller Traffic signal controller 
Model of changing indoor temper-|Model of changing the number of 
System ature and humidity affected by en-|waiting cars in the intersection af- 
model vironment conditions and the sys-|fected by car inflow and traffic sig- 
tem’s control nals 
Temperature sensor, humidity|Traffic flow sensors for each 12 di- 
Sensors F 
sensor rections 
ae Temperature control actuator, hu- Traffic lights 
midity control actuator 
Adaptation |Temperature control value, hu-|Configuration of traffic signal pat- 
tactic midity control value tern duration 
; 101 possible control values for : ; 
Size of the aa 6,188 possible configurations of 
adaptation cachi temperature and hiimidity by traffic signal pattern duration 


tactic space 


the system capability (-5, -4.9, ... 
+4.9, +5 (°C, %)) 


(Fig. 6) 


Adaptation 1 hour 1 hour 

cycle 

Adaptation |Target air condition (25, 50) - fol-| Minimizing the number of waiting 
goal lowing ASHRAE comfort zone [8] |cars 

Tactic Average difference between con- is 
evaluation trolled indoor condition and tar- Average'of the number of waiting 
criteria get condition oe 

Forecasting {Random walk model with seasonal Polynomial reeresdioñ 

method differencing [30] y 8 


evaluation results for each system. The reported planning time is the average 
of 100 repeated experiments. The adaptation planning time for the PMC-based 
approach is constant, but the time for PASTA increases in proportion to the 
number of samples used for the SMC because the time for a single simulation 
is almost constant. Unfortunately, the traffic signal controller was not able to 
obtain adaptation planning results using PMC with a 2G memory because its 
models and tactics were more complex than the air condition control system so 
consume larger verification resource. Therefore, for the traffic signal controller, 
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Fig. 7. Adaptation planning cost - Air condition control system 
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Fig. 8. Adaptation planning cost - Traffic signal controller 


the adaptation planning time for the PMC-based approach was not assigned; 
however, both systems confirmed that PASTA would complete adaptation plan- 
ning much faster than the PMC-based approach. It was also confirmed that the 
adaptation planning time of PASTA is proportional to the number of samples 
and the complexity of the adaptation problem. 

RQz2: To confirm the similarity of the optimal tactics that the PASTA and 
PMC-based approaches found, we compared the optimal tactics returned by the 
PASTA and PMC-based approaches in the same situation. To quantify the simi- 
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larity, we defined two criteria. If the two tactics were the same, they were defined 
as identical, and if they were adjacent in terms of the tactic specifications, they 
were defined as similar. For example, for the air condition control system, tem- 
perature control tactics +3°C and +3.1°C' were adjacent because the tempera- 
ture control unit is 0.1C based on the system’s capability, and the probability 
that arbitrarily two tactics are adjacent is less than 2%. Because the samples 
used by SMC are randomly generated, we repeated the PASTA experiments 100 
times and report the percentage of identical or similar tactics compared to the 
tactic returned by the PMC-based approach. Because the traffic signal controller 
could not find the optimal tactic utilizing PMC, only the experimental results 
of the air condition controller are shown in Fig. 9. We could see that PASTA 
always found the same or similar optimal tactic as the PMC-based approach 
except when using 10 samples; however, one limitation of utilizing SMC is that 
regardless of how many samples we increased, we could not always obtain the 
same results as the PMC-based approach’s results, which is considered an oracle. 
This case system returned accurate results at approximately 50% on average. 
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Fig. 9. Adaptation planning accuracy - Air condition control system 


RQ3: For RQ1 and RQ2, we showed that PASTA can quickly find a sub- 
optimal adaptation tactic that is similar to the PMC-based approach’s result. 
For RQ3, we obtained simulation results to confirm the adaptation performance 
of the PASTA approach in comparison with non-adaptation, reactive adapta- 
tion, and PMC-based proactive adaptation. As shown in Fig. 10, the goal of the 
air condition control system was to keep the temperature at 25°C, and proac- 
tive adaptation approaches showed a better adaptation performance than other 
strategies. In addition, the PASTA and PMC-based approaches exhibited a simi- 
lar performance because PASTA has always made similar adaptation decisions to 
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Fig. 10. Adaptation performance - Air condition control system 
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Fig. 11. Adaptation performance - Traffic signal controller 


the PMC-based approach. In Fig. 11, the goal of the traffic signal controller was 
to reduce the number of vehicles waiting at the intersection as much as possible, 
and proactive adaptation using PASTA showed the best performance. These two 
results demonstrate that proactive adaptation outperforms reactive adaptation 
and PASTA shows similar adaptation performance to the PMC-based approach 
with smaller verification cost. 
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Table 2. Comparison of proactive adaptation approaches 
PMC-based approach SMC-based approach (PASTA) 
Forecasting time Forecasting time 
ae Modeling time (relatively high) |Sampling time (relatively low) 
Probabilistic verification time Statistical verification (uo pon 
(relatively high) + hypothesis testing) time (rela- 
y tively low) 
Adaptation |Regarded as an oracle (high, Provides Simula adaptation resulta 
s ; to PMC-based adaptation (rela- 
accuracy and|limited to the quality of the},. teat : 
erforinance moódéls) tively low, limited to the quality of 
p the samples and models) 
A sub-optimal adaptation tactic can 
F : . |be found with a lower adaptation 
Pros re nee adaptation factie cost. If the model can be simulated, 
` it is not limited to a particular mod- 
eling language. 
High adaptation cost is re- 
Cons quired. Modelling language|The adaptation result is not fully 
is dependent on the model|trustworthy. 
checker. 
Proper , Safety-critical system Real-time system 
application 


We compared two approaches of proactive adaptation: PMC-based and SMC- 
based (PASTA) approaches. As we confirmed in our evaluation, the two ap- 
proaches have their own advantages and disadvantages, so engineers should 
carefully decide which to choose for their SAS development. We summarized 
our insights regarding their characteristics in Table 2 to guide engineers’ de- 
cision making. As we emphasized, the SMC-based approach makes adaptation 
decisions, verifying a system’s adaptation tactics faster than the PMC-based ap- 
proach. In addition, if it is possible to generate simulation results from the given 
models, the modeling language is not limited to the model checker; however, it 
is indubitable that an adaptation decision made by the SMC-based approach 
may not be globally optimal. Therefore, the SMC-based approach may not be 
suitable for some safety-critical systems, and the PMC-based approach could 
be the better choice if the trustworthiness of the system is the most important 
concern. For SASs requiring a lower adaptation cost, such as real-time systems, 
PASTA is more appropriate than the PMC-based approach. 


7 Threats to Validity 


One threat is the selection of the SMC algorithm. We selected SMCS to demon- 
strate the adaptation performance when selecting the simplest SMC algorithm. 
SMCS is suitable for explicitly indicating SMC-based adaptation costs affected 
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by the number of samples, and all other SMC algorithms have similar character- 
istics. To reduce this threat, we also implemented SSP and SPRT and compared 
them to the PMC-based approach, and both showed similar cost, accuracy, and 
performance differences. Therefore, for this paper, only SMCS was selected and 
explained by varying the number of samples. 

Another threat is the implementation of the PMC-based adaptation ap- 
proach. We implemented the PMC-based approach directly following paper [26]. 
This threat was reduced because the authors published all the structures and 
codes of the PRISM module for the implementation of the approach. We im- 
plemented two case systems according to the PRISM module code shown in 
the paper. For a fair comparison, environment, system, and adaptation tactic 
spaces of the same complexities were given to both the PMC-based and PASTA 
approach. 


8 Conclusion 


We have proposed PASTA, a proactive adaptation approach using SMC, that 
is one efficient alternative to PMC-based proactive adaptation. We applied the 
PASTA approach to two realistic SASs. Through experiments based on actual 
data, we confirmed that PASTA would make an adaptation decision similar to 
the PMC-based proactive application approach in a shorter time. We then con- 
firmed that the adaptation decision is more effective in achieving the system’s 
goals than non-adaptation, reactive adaptation, and the PMC-based approach. 
Currently, PMC-based approaches are considered the major trend in proactive 
adaptation, but in this paper, we showed that the SMC-based proactive adap- 
tation approach can be an efficient alternative. In addition, the algorithmic pro- 
cesses, reference architecture, and open-source skeleton of PASTA proposed in 
this paper will be of substantial help to developers who wish to apply PASTA to 
SAS development. This study was primarily conducted to validate the PASTA 
approach, but in the future, we plan to study methods such as effective sampling 
and adaptation space reduction for a more effective PASTA approach, and we 
also plan to apply PASTA to actual running systems. 
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Abstract. Deep Neural Networks (DNNs) are being deployed in a wide 
range of settings today, from safety-critical applications like autonomous 
driving to commercial applications involving image classifications. How- 
ever, recent research has shown that DNNs can be brittle to even slight 
variations of the input data. Therefore, rigorous testing of DNNs has 
gained widespread attention. 

While DNN robustness under norm-bound perturbation got significant 
attention over the past few years, our knowledge is still limited when 
natural variants of the input images come. These natural variants, e.g., 
a rotated or a rainy version of the original input, are especially concerning 
as they can occur naturally in the field without any active adversary and 
may lead to undesirable consequences. Thus, it is important to identify 
the inputs whose small variations may lead to erroneous DNN behaviors. 
The very few studies that looked at DNN’s robustness under natural 
variants, however, focus on estimating the overall robustness of DNNs 
across all the test data rather than localizing such error-producing points. 
This work aims to bridge this gap. 

To this end, we study the local per-input robustness properties of the 
DNNs and leverage those properties to build a white-box (DEEPROBUST- 
W) and a black-box (DEEP RoBusT-B) tool to automatically identify the 
non-robust points. Our evaluation of these methods on three DNN mod- 
els spanning three widely used image classification datasets shows that 
they are effective in flagging points of poor robustness. In particular, 
DEEPRoBustT-W and DEEPRoBuST-B are able to achieve an F1 score 
of up to 91.4% and 99.1%, respectively. We further show that DEEP- 
Rosust-W can be applied to a regression problem in a domain be- 
yond image classification. Our evaluation on three self-driving car mod- 
els demonstrates that DEEPROBUusT-W is effective in identifying points 
of poor robustness with F1 score up to 78.9%. 


Keywords: Deep Neural Networks - Software Testing - Robustness of 
DNNs. 


1 Introduction 


Deep Neural Networks (DNNs) have achieved an unprecedented level of perfor- 
mance over the last decade in many sophisticated areas such as image recogni- 
tion [38], self-driving cars [5] and playing complex games [65]. These advances 
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Fig. 1: (a)-(d) A well-trained Resnet model [14] misclassifies the rotated 
variations of a bird image into three different classes though the original 
un-rotated image is classified correctly. (e)-(h) The same model successfully 
classifies all the rotated variants of another bird image from the same test 
set. The sub-captions consist of rotation degrees and the predicted classes. 


have also motivated companies to adapt their software development flows to in- 
corporate AI components [3]. This trend has, in turn, spawned a new area of 
research within software engineering addressing the quality assurance of DNN 
components |11, 20,32, 36, 40,42, 55,57, 73, 74,91, 92]. 

Notwithstanding the impressive capabilities of DNNs, recent research has 
shown that DNNs can be easily fooled, i.e., made to mispredict, with a lit- 
tle variation of the input data [14, 23, 73]—either adding a norm-bound pixel- 
level perturbation into the original input [9, 23,71], or with natural variants of 
the inputs, e.g., rotating an image, changing the lighting conditions, adding fog 
etc. [14,52,55]. The natural variants are especially concerning as they can oc- 
cur naturally in the field without any active adversary and may lead to serious 
consequences [73, 92]. 

While norm-bound perturbation based DNN robustness is relatively well- 
studied, our knowledge of DNN robustness under the natural variations is still 
limited—we do not know which images are more robust than others, what their 
characteristics are, etc. For example, consider Figure 1: although the original 
bird image (a) is predicted correctly by a DNN, its rotated variations in images 
(b)-(d) are mispredicted to three different classes. This makes the original image 
(a) very weak as far as robustness is concerned. In contrast, the bird image 
(e) and all its rotated versions (generated by the same degrees of rotation) in 
Figure 1:(f)-(h) are correctly classified. Thus, the original image (e) is quite 
robust. It is important to distinguish between such robust vs. non-robust images, 
as the non-robust ones can induce errors with slight natural variations. 

Existing literature, however, focuses on estimating the overall robustness of 
DNNs across all the test data [4, 14,88]. From a traditional software point of 
view, this is analogous to estimating how buggy a software is without actually 
localizing the bugs. Our current work tries to bridge this gap by localizing the 
non-robust points in the input space that pose significant threats to a DNN 
model’s robustness. However, unlike traditional software where bug localization 
is performed in program space, we identify the non-robust inputs in the data 
space. As a DNN is a combination of data and architecture, and the architecture 
is largely uninterpretable, we restrict our study of non-robustess to the input 
space. To this end, we first quantify the local (per input) robustness property of 
a DNN. First, we treat all the natural variants of an input image as its neigh- 
bors. Then, for each input data, we consider a population of its neighbors and 
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measure the fraction of this population classified correctly by the DNN - a high 
fraction of correct classifications indicates good robustness (Figure l:e) and vice 
versa (Figure l:a). We term this measure neighbor accuracy. Using this metric, 
we study different local robustness properties of the DNNs and analyze how 
the weak, a.k.a. non-robust, points differ characteristically from their robust 
counterparts. Given that the number of natural neighbors of an image can be 
potentially infinite, first we performed a more controlled analysis by keeping the 
natural variants limited to spatially transformed images generated by rotation 
and translation, following the previous work [4, 14, 88]. Such controlled exper- 
iments help us to explore different robustness properties while systematically 
varying transformation parameters. 


Our analysis with three well-known object recognition datasets across three 
popular DNN models, i.e., a total of nine DNN-dataset combinations, reveal 
several interesting properties of local robustness of a DNN w.r.t. natural variants: 


— The neighbors of a weaker point are not necessarily classified to one single 
incorrect class. In fact, the weaker the point is its neighbors (mis)classifications 
become more diverse. 


— The weak points are concentrated towards the class decision boundaries of the 
DNN in the feature space. 


Based on these findings, we further develop two techniques (a black-box and 
a white-box) that can localize the points of poor robustness, thereby providing 
a means of, input-specific, real-time feedback about robustness to the end-user. 
Our white-box and black-box detectors can identify weak, a.k.a. non-robust, 
points with fl score up to 91.4% and 99.1%, respectively, at neighbor accuracy 
cutoff 0.75. To further check the generalizability of our technique, we aim to de- 
tect weak points w.r.t. a self-driving car application where we generated natural 
input variants by adding rain and fog. Note that these are more complex im- 
age transformations, and also the model works in a regression setting instead of 
classification. These models take an image as input, and output a driving angle. 
Our white-box detector can identify weak points with fl score up to 78.9%. 


In summary, we make the following contributions: 


— We conduct an empirical study to understand the local robustness properties 
of DNNs under natural variations. 


— We develop a white-box (DEEP ROBUST-W) and a black-box (DEEPROBUST- 
B) method to automatically detect weak points. 


— We present a detailed evaluation of our methods on three DNN models across 
three image classification datasets. To check the generalizability of our find- 
ings, we further evaluate DEEPROBUST-W in a setting with non-spatial trans- 
formations (i.e., rain and fog), a different task (i.e., regression), and a safety- 
critical application (i.e., self-driving car). We find that DEEPROBUST can 
successfully detect weak points with reasonably good precision and recall. 


— We made our code public at https://github.com/AIasd/DeepRobust. 
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2 Background: DNN Testing 


Existing studies have proposed different techniques to generate test data inputs 
by perturbing input images for a DNN and use them to evaluate the robustness 
of the DNN. Depending on how the input image is perturbed, the techniques for 
generating DNN test data can be classified into three broad categories: 

i) Adversarial inputs are typically generated by norm-based perturbation 
techniques [9, 23, 39, 46, 53, 85] where some pixels of an input image (J) are 
perturbed by norm-based distance (l1,l2 or ling) such that the distance between 
the perturbed image and I is < e, where € is a small positive value. These 
adversarial examples are used to expose the security vulnerabilities of DNNs. 

ii) Natural variations are generated through a variety of image transfor- 
mations, and are used to evaluate the robustness of DNNs under such varia- 
tions [13, 14,73]. Sources of these variations include changes in camera configu- 
ration, or variations in background or ambient conditions. The transformations 
simulating these variations could be spatial, such as rotation, translations, mir- 
roring, shear, and scaling on images, or non-spatial transformations, such as 
changes in the brightness or contrast of an image. Here we first focus on spatial 
transformations as opposed to adversarial one for two reasons. First, compared 
with adversarial examples, which is fairly contrived, spatial transformations are 
more likely to arise in more benign environments. Second, using simple para- 
metric spatial transformations like rotations and translations, it is easier to sys- 
tematically explore the local robustness properties. Later, to emulate a more 
natural variation we add fog and rain on the images of self-driving car dataset 
and evaluate our method’s generalizibility. 

iii) GAN-based image generation techniques use Generative Adversarial Net- 

work (GAN) to synthesize images. GAN is one class of generative models trained 
as a minimax two-player game between a generative model and a discriminative 
model [22]. GAN-based image generation has been successfully used to generate 
DNN test data instances [92,93]. 
Standard Accuracy vs. Robust Accuracy. Standard accuracy measures how 
accurately an ML model predicts the correct classes of the instances in a given 
test dataset. Robust, a.k.a. adversarial accuracy, estimates how accurately an 
ML model classifies the generated variants [76]. In this paper, we adopt a point- 
wise robust accuracy measure, neighbor accuracy, to quantify the robustness of 
a DNN for the neighbors around each data point. 


3 Methodology 


3.1 Terminology 


Original Data Point: An original data point represents an original un-modified 
data instance (image in our case) in the studied dataset. The original data points 
can come from training, validation, or testing dataset, depending on the exper- 
imental setting. In Figure 2, the triangle in the center is an original data point. 
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Neighbors: Neighbors are images generated by the natural variations, e.g., 
spatial transformations applied to an original image. Since the transformation 
parameters are continuous (e.g., degree of rotations), there can be an infinite 
number of neighbors per image. In Figure 2, the small circles around an original 
data point represent its neighbors. 


Neighbor Accuracy: We define neighbor accuracy as the percentage of its 
neighbors, including itself, that can be correctly classified by the DNN model. 
Figure 2 illustrates this; here, red small circles indicate misclassified neighbors, 
while the green small circles are correctly classified ones. The figure shows that 
there are only five neighbors per original data point. In the left-hand-side dia- 
gram, four out of five neighbors are correctly classified by the given DNN model. 
If the original data point is correctly classified as well, the neighbor accuracy of 
the original data is (5/6=) 83.3%. Similarly, in Figure 2 (right), four out of the 
five neighbors have been misclassified by the model; if the original data point is 
misclassified, the neighbor accuracy is (1/6=) 16.6%. 


Robustness. An original data point 


is strong, a.k.a. robust, w.r.t. the Robust Region Weak Region 

DNN model under test if its neigh- © è Ge Pais 
bor accuracy is higher than a pre- © A © © A @ osato raitiar 
defined threshold. Conversely, a weak, O © weak original point 


. . eorrectiy classified neighbor 
a.k.a. non-robust, point has the neigh- Fig. 2: Illustrating our terminologies. 


bor accuracy lower than a pre-defined The triangles are original points, and 
threshold. For example, at 0.75 neigh- the small circles are their neigh- 
bor accuracy threshold, the black tri- bors generated by natural variations. 
angle in Figure 2 isa strong point, and The light-green region is robust with 
the grey triangle is a weak point. higher neighbor accuracy, while the 
light-red region is vulnerable. The 
corresponding original points are ro- 
bust and non-robust accordingly. 


A region contains an original point 
and all of its neighbors. If the original 
point is strong (weak), we call the cor- 
responding region as a robust (weak) region. In Figure 2, the light green region 
is robust while the light red region is weak. 


Neighbor Diversity: For multi-class classification task, different neighbors of 
an original point can be mis-classified to different classes. Neighbor Diversity 
score measures how many diverse classes a point’s neighbors are classified, and 
is formally computed using Simpson Diversity Index (A) [67]: A = = p? (1) 


where k is the total number of possible classes and p; is the probability 
of an image’s neighbors being predicted to be class i. Large Simpson Index 
means low diversity. Let’s consider we have three possible classes A, B, and 
C. Assume an image has 4 neighbors. Including the original image, there are 
5 images in total. If two of the five images are classified as A, and rest are 
classified as B, then A = (2/5)? + (3/5)? + (0/5)? = 0.52. In contrast, if two of 
them are classified as A, and two are classified as B, and one is classified as C 
then \ = (2/5)? + (2/5)? + (1/5)? = 0.36. Clearly, the latter case is more diverse 
and thus, has a lower X score. 
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Feature Representation: In a DNN, the neurons’ output in each layer capture 
different abstract representation of the raw input, which are commonly known 
as features, extracted by the current layer and all the preceding layers. Each 
layer’s output forms the corresponding feature space. For a given input data 
point, we consider the output of the DNN’s second-to-last layer as its feature 
representation or feature vector. 


3.2 Data Collection 


Neighbor Generation: For the image classification tasks, for each original im- 
age point, we generate its neighbors by combining two types of spatial transfor- 
mations: rotation and translation. We carefully choose these two types as repre- 
sentatives of non-linear and linear spatial transformations, respectively, following 
Engstrom et al. [14]. In particular, following them, we generate a neighbor by 
randomly rotating the original point by t (€ [—30, 30]) degrees, shifting it by dx 
(about 10% of the original image’s width i.e. € [—3,3]) pixels horizontally, and 
shifting it by dy (about 10% of the original image’s height i.e. € [—3,3]) pixels 
vertically. It should be noted that for image classification it is standard in the 
literatures [14, 15,86] to assume that the transformed image has the same label 
as the original one. As the transformation parameters are continuous, there can 
be infinite neighbors of an original data point. Hence, we sample m neighbors 
for each original data point. We explore the impact of m in RQ2. 

For the self-driving-car task where the model predicts steering angle, for each 
original image point, we generate 50% neighbors with rain effect and the rest 
50% with fog effects. We adopt a widely used self-driving car data augmentation 
package, Automold [60], for adding these effects where we randomly vary the 
degrees of the added effect. For the rain effect, we set “rain_ type=heavy" and 
everything else as default. For the fog effect, we set everything as default. 
Estimating Neighbor Accuracy: To compute the neighbor accuracy of a 
data point for a given DNN model, we first generate its neighbor samples by 
applying different transformations—spatial for image classification and rain or 
fog for self-driving-car application. Then we feed these generated neighbors into 
the DNN model and compute the accuracy by comparing the DNN’s output with 
the label of the original data point. For self-driving-car application, we follow the 
technique described in DeepTest [73]. More specifically, if the predicted steering 
angle of the transformed image is within a threshold to the original image, we 
consider it as correct. This ensures that any small variations of steering angle 


are tolerated in the predicted results. We then compute neighbour accuracy = 
#correct predictions 
original point+#total neighbours’ 


3.3 Classifying Robust vs. Weak Points 


We propose two methods, DEEPROBUST-W and DEEPROBUST-B, to identify 
whether an unlabeled input is strong or weak w.r.t. a DNN in real time. If a test 
image is identified as a weak point, although it may be classified correctly by 
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the pre-trained model, this image is in a vulnerable region where a slight change 
to this image may cause the pre-trained DNN to misclassify the changed input. 


— ground-truth 
Original 
Training labels 
Data 
y y 
Generate SONE Neighbor Strong/Weak 
Neighbors Test Accuracy Points 
feature Strong 
Pre-trained Point 
Test D 
vectors eS Data [>] DNN under eaire ee 
Ls) DEEPROBUST-W k——— ic a ee 
(a) workflow - training (b) workflow - testing 


Fig. 3: Workflow of DEEPRoBUsT-W 


DEEPROBUST-W: White-box Classifier This is a binary classifier designed 
to classify an image (in particular, image feature vector) as a strong or weak 
point. Here, we assume that we have white box access to the DNN under test 
to extract the feature vectors of the input images from the DNN. These feature 
vectors are given as inputs to DEEPROBUST-W. Figure 3 shows the workflow. 


Training: During training of DEEPROBUST-W, we first feed all the original 
training images and their neighbors to the DNN under test. From the DNN 
outputs, we compute the neighbor accuracy for each data point in the training 
set and label each point strong/weak depending on whether its neighbor accuracy 
is higher /lower than a predefined threshold. For each original data point, we also 
extract the output of the DNN’s second-to-last layer as its feature vector. We 
use these vectors as inputs to train DEEPROBUST-W and the outputs are the 
corresponding strong/weak labels. 

Testing: Given a test input, we extract its feature vector by feeding the test 
image to the DNN under test and then feed the extracted feature vector to the 
trained DEEPROBUST-W, which predicts if the input is a strong or weak point. 


DEEPROBUST-B: Black-box Clas- | ‘stata 


sifier This is also a binary classifier l — — 
that is intended to classify an image Neighbors [| ain Sae 
to strong/weak point. However, here | Siang 
the user does not have white box ac- soo eS ig | DEEPROBUST-B KC aa 
cess to the DNN under test. Figure 4 wens 
shows the workflow. Fig. 4: Workflow of DEEPRoBusT-B 


Given a test input, we first randomly generate some of its neighbors. We then 
query the DNN under test with all these neighbors and compute the diversity 
score, as per Equation 1. If the neighbor diversity score (inversely correlated 
with neighbor diversity) is greater than a given diversity score threshold, the 
given test input is classified as a strong point; otherwise, a weak point. 

Notice that, in this method, we do not need a training step. We only need the 
diversity score threshold, which can be empirically set using a ground-truth data 
set. In particular, we first calculate the neighbor accuracy and diversity score of 
each pre-annotated point. Next, based on a given neighbor accuracy threshold, 
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we identify the weak points, as the ground truth. The highest diversity score 
among these weak points is chosen as the diversity score threshold. 


Usage Scenario DEEPRoBusT-W/B works in a real-world setting where a cus- 
tomer /user runs a pre-trained DNN model in real-time which constantly receives 
inputs and wants to test if the prediction of the DNN on a given input can be 
trusted. DEEPROBUST-W assumes that the user has white-box access to DNN 
under test and all the training data used to train the DNN. DEEPROBUST-W 
leverages the feature vector and neighbor accuracy of the training data to train 
the classifier, which can notify the user if the current input is a strong point or 
weak point. If the input is classified as strong point, the user can give more trust 
to the original DNN’s prediction. On the other hand, if the point is classified as a 
weak point, the user may want to be more cautious about the DNN’s prediction 
and conduct additional inspections. 

In the blackbox setting, DEEPROBUST-B assumes the user does not have 
white-box access to DNN under test. DEEPROBUST-B comes with a small over- 
head of transforming the input multiple times to get some neighbors and query- 
ing DNN under test on them to estimate the diversity score. 


4 Experimental Design 


4.1 Study Subjects 


Image Classification Similar to many existing works [36, 41,61, 73, 74,92] on 
DNN testing, in this work, we use image classification application of DNNs as 
the basis of our investigation. This is one of the most popular computer vision 
tasks, where the model tries to classify the objects in an image or video. 


Datasets: We conduct our experiments on three image classification datasets: 
F-MNIST [87], CIFAR-10 [37], and SVHN [89]. 


— CIFAR-10: consists of 50,000 training and 10,000 testing 32x32 color images. 
Each image is one of ten digit classes. 

— F-MNIST: consists of 60,000 training images and 10,000 testing 28x28 gray- 
scale images. Each image is one of ten fashion product related classes. 

— SVHN: consists of 73,257 training images and 26,032 testing images. Each 
image is a 32x32 color cropped image of house numbers collected from Google 
Street View images. 


Architectures: The popular DNN-based image classifiers are variants of con- 

volutional neural networks (CNN) [28, 38,79]. Here we study the following three 

architectures for all the three datasets: 

— ResN: Following Engstrom et al. [14], we use ResN model with 4 groups of 
residual layers with filter sizes 16, 16, 32, and 64, and 5 residual units each. 

— VGG: We use the same VGG architecture as proposed in [66]. 
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— WRN: We use a structure with block type (3, 3) and depth 28 in [90] but 
replace the widening factor 10 with 2 for less parameters and faster training. 
We train all the models from scratch using widely used hyper-parameters and 
achieve accepted level of validation natural accuracy). When training models on 
CIFAR-10, we pre-process the input images with random augmentation (random 
translation with dx, dy € [—2, 2] pixels both horizontally and vertically) which is 
a widely used preprocessing step for this dataset. When training models on the 
other two datasets, plain images are directly fed into the models. The natural 
accuracies and robust accuracies of the models are shown in Table 1. 


Steering Angle Prediction Table 1: Study Subjects (values are in percent- 


We further evaluate DEEP- age) 

RoBusT-W in a self-driving Dataset | CIFAR-10 | SVHN | E-MNIST 
car application to show that Model |VGG|ResN|WRN|VGG|ResN|WRN|VGG|ResN|WRN 
it can be applied into rgves- EA jog boa HOG pots sa a wt a5 eg 
sion task. These models learn ’Natural accuracy. *Robust accuracy is estimated as the 
to steer (ie., predict steering average neighbor accuracy for test data points. 
angle) by taking in visual inputs from car-mounted cameras that record the 
driving scene, paired with the steering angles from a human driver. 

Datasets: We use the dataset by Stocco et al. [68], which is collected by the 
authors driving on three tracks of different environments in the Udacity Simu- 
lator [77]. It consists of 37888 central camera training images and 9427 central 
camera evaluation images. Each image is of size 320x120. 


nat acc’ 
rob acc* 


Architectures: We evaluate our method on the three pre-trained DNN models 
used in [68]: NVIDIA DAVE-2 [6], Epoch [2], and Chauffeur [1]. These models 
have been used by many previous testing works on self-driving car [55,68,73]. 


4.2 Evaluation 


Evaluation Metric. We evaluate both DEEPROBUST-W and DEEPROBUST-B 
for detecting weak points under twelve and nine different DN N-dataset combina- 
tions, respectively, in terms of precision, recall, and F1 score. Let us assume that 
E is the number of weak points detected by our tool and A is the the number 
of true weak points in the ground truth set. Then the precision and recall are 


ANE ANE ; . : 
EI l and HE. respectively. F1 score is a single accuracy measure that con- 


siders both precision and recall, and defined as 2*2rectstonxrecall We perform 
precision+recall 


each experiment for two thresholds of neighbor accuracy that defines strong vs. 
weak points: 0.75 and 0.50. 

Baselines. We compare DEEPROBUST-W and DEEPROBUST-B with two base- 
lines. One naive baseline (denoted random) is randomly selecting the same num- 
ber of points as detected by our proposed method to be weak points. Another 
baseline (denoted top1) is based on prediction confidence score—if the confi- 
dence of a data point is higher than a pre-defined cutoff we call it a strong point, 
weak otherwise. This baseline is based on the intuition that DNNs might not be 
confident enough to predict the weak points. 
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5 Results 


In this section, we elaborate on our results. In our preliminary experiments, 
we have two findings regarding neighbor accuracy. First, the neighbor accuracy 
vary widely across data points and there is a non-trivial number of points hav- 
ing relatively low neighbor accuracy. For example, for all the models trained on 
CIFAR-10 dataset, 40% of training data and 42% of testing data have neighbor 
accuracy <0.75, and 16% of training data and 20% of testing data have neigh- 
bor accuracy <0.50. These points degrade the aggregated spatial robustness of 
the model. The same finding holds for the other two datasets. Second, the dis- 
tribution of neighbor accuracy for a dataset is similar across different models. 
For CIFAR-10, F-MNIST and SVHN, 60%, 76%, and 81%, respectively, of data 
points have neighbor accuracy change < 0.2 across any two models on the same 
dataset. This implies that a large portion of data points’ neighbor accuracy is 
independent of the model selected. 

The first observation shows that neighbor accuracy is a distinguishable mea- 
sure for local robustness for the datasets and models we study. The second 
observation implies that the properties of points of low neighbor accuracy may 
be similar across models for each dataset. Following these two observations, we 
dive deeper and explore the characteristics of data points with different neighbor 
accuracy in RQ1. We then evaluate the performance of DEEPROBUST-W and 
DEEPROBUST-B which are developed based on the observations from RQ1 in 
RQ2 and RQ3, respectively. Finally, in RQ4, we evaluate the generalizability of 
our method by applying DEEPROBUST-W in a regression task for self-driving 
cars under more complex transformations. 


RQ1. What are the characteristics of the weak points? 

We explore the characteristics of robust vs. non-robust points in their feature 
space. In particular, we check the difference in feature representations between: 
a) robust and non-robust points, and b) points with different degrees of robust- 
ness. 


RQla. Given a well trained model, do the feature representations of robust and 
non-robust points vary? In this RQ, we first explore how robust (i.e., strong) 
and non-robust (i.e., weak) data points are distributed in the feature space. 

We apply t-SNE[44], a widely used visualization method, to visualize the 
distribution of points of different neighbor accuracy in the representation space 
for all three datasets when using ResN as the classifier. Figure 5 shows the 
visualization of feature vectors from two randomly picked classes with colors 
indicating the neighbor accuracy of each point. The darker a point’s color is, 
the lower its neighbor accuracy is. It is evident that most points of low neighbor 
accuracy tend to be further away from the class center. 

To numerically verify this observation, first, we define a class center cz for 
each class k as the median value of the feature vectors of all the points from class 
k. Thus, if f; is the feature of a point at it” dimension and fix is the median 
of the it” dimension features for all the points in class k, c is defined to be 


(fik see fies e3 fnk): 
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(a) CIFAR-10 (b) F-MNIST (c) SVHN 
Fig.5: The t-SNE plots of data points from two randomly chosen classes 
across three datasets using ResNet. Darker color indicates lower neighbor 
accuracy. 


The reason we take median rather than mean is that it is a more sta- 
tistically stable measure and is less likely to be heavily influenced by out- 


liers in the representation space. Then, for every point p, we define a ratio: 
q®) 


re) = same_class where dP) is the distance of the p-th point’s 


(p) same_ class 
nearest _other_class = 
(p) 


feature vector to its own class center and d carest other class 8 the distance of 
the p-th point’s feature vector to the class center of its closest other class. A 
small r) means that the point p is close to its own class center while far from 
other classes, i.e., p is far from the decision boundary. In contrast, a larger r) 
indicates that the point p is closer to some other classes, i.e., it is closer to the 


decision boundary. Table 2: Weak and strong points ratio, and co- 
We then measure the av- hen’s d effect size 
erage rP) among the weak Dataset) CIFAR-10 SVHN F-MNIST 


points (denoted as ry) and Model |ResN]/WRN|VGG|ResN] WRN|VGG|ResN] WRN|VGG 


among, strong points (de- Neighbor Accuracy Cutoff=0.5 
noted as rs) for all three 


Pw 0.915 |0.955 |1.004|1.046 |1.103 |0.997|0.746 |0.734 |0.976 
datasets across three mod- r, 0.609|0.584 [0.975 0.294 10.309 |0.977|0.297|0.293 |0.930 
els. Besides. we also calcu. € 1.368] 1.736 |1.163]2.077|2.428 |1.420]1.426]1.312 |1.332 
late mann-whitney wilocox Neighbor Accuracy Cutoff=0.75 
test[47] and cohen’s d effect "« 0.778 |0.796 |0.992|0.604 |0.671 |0.983|0.516|0.496 |0.953 
à Ts 0.588 10.558 |0.973|0.260 [0.274 |0.977|0.253|0.257 |0.918 
size [10] between the two ra- ą* 0.786] 1.040 |0.749]0.860 1.111 ]0.401|0.749|0.642 (0.937 
tios to test if the two ratios in- *Cohen’s d effect size of 0.20 = small, 0.50 = medium, 


deed have statistically signifi- 0.80 = large, 1.20 = very large, and 2.0 = huge [10,59]. 
cant difference and how large the difference is. 

As shown in Table 2, for both the neighbor accuracy cutoff (0.5 and 0.75), 
except one setting, the cohen’s d effect size for every setting is larger than 0.50, 
which implies a medium to very large difference. Besides, for every setting, the 
mann-whitney wilocox test value (not shown in the table) is smaller than 1e7®?, 
which implies the difference is indeed statistically significant. 

The visualization and numerical results imply that most weak points are 
close to the decision boundaries between classes. Note that similar observation 
was also observed by Kim et. al. [36] in case of adversarial perturbation. In 
particular, they find that adversarial examples tend to be closer to class decision 
boundaries. In contrast, we focus on spatial robustness and find that spatially 
non-robust points are closer to decision boundaries. 
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RQ1b. Given a well trained model, do the feature representations of the data 
points vary by their degree of robustness? By analyzing the classifications of 
the neighbors of weak vs. strong points, we observe that the weaker a point is, 
its neighbors are more likely to be classified in different classes. We quantify this 
observation by computing diversity of the outputs a point’s neighbor; We adopt 
Simpson Diversity Index (A) [67] as defined in Equation (1). 

Table 3 shows the Spear- Table 3: Spearman Correlation between 
man correlation between neigh- Neighbor Accuracy and Simpson Diversity 
bor accuracy and A on the Index. All coefficients are reported with sta- 
three datasets and three mod- tistical significance (p < 0.05). 
els for each. Note that while Dataset CIFAR-10 | SVHN | F-MNIST 
calculating the correlation, we Model ResN WRN VGG|ResN WRN VGG|ResN WRN VGG 
remove points with neighbor corr.coeff. 0.853 0.909 0.946|0.970 0.984 0.983|0.923 0.962 0.8947 


accuracy 100% since there are many points having 100% neighbor accuracy and 
tend to bias upward the Spearman Correlation; if we include points with neigh- 
bor accuracy 100%, the correlations become even higher. We notice that for any 
setting, the Spearman Correlation is never lower than 0.853. This indicates that 
neighbor accuracy and diversity are highly correlated with each other. For exam- 
ple, the bird image in Fig.la has neighbor accuracy 0.49 and diversity 0.36, while 
the bird image in Fig.le has neighbor accuracy 1 and diversity 1. This shows, 
the classifier tends to be confused about weak points and mispredicts them into 
many different kinds of classes. 


Result 1: In the representation space, weak points tend to lie towards the 
class decision boundary while the strong points lie towards the center. The 
weaker an image is, the model tends to be more confused by it, and classify 
its neighbors into more diverse classes. 


RQ2. Can we detect the weak points in a white-box setting? 


We explore this RQ using DEEPROBUST-W, as discussed in Section 3.3. 
DEEPROBUST-W takes the feature vector of a data point as input and classifies 
it to a strong/weak point. We implement DEEPROBUST-W with a simple 4-layer, 
fully connected neural network architecture with hidden layer dimensions 1500, 
1000, and 500, respectively. 

Table 4 shows the result. At 0.75 setting, DEEPROBUST-W has F1 up to 
91.4%, with an average of 76.9%. At 0.50 setting, DEEPROBUST-W detects weak 
points with average F1 of 61.1%, while it can go up to 79.1%. DEEPROBUST-W 
consistently performs significantly better than the baseline methods. 

The topl has very good precision, since a mis-classified image with low con- 
fidence tends to have very poor local robustness. However, there also exist many 
images that are correctly classified with high confidence yet have poor local ro- 
bustness. The miss of these points leads the top1 to have very poor recall and 
thus even worse F1 compared with the random baseline. Our method comes to 
aid by providing high recall at the same time of decent precision. 
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Table 4: Performance of DEEP- Notice that DEEPROBUST-W’s perfor- 
Rosust-W and the baseline mance depends on the training data selec- 
methods for predicting weak tion, mainly (a) how many weak vs. strong 
points. points are used to train the model, and 
dataset model | method trad m — = (b) how many neighbors are generated 


GIFARI0|RaN [ome | 079 asa 7eal05e1 100 o4 Per point to decide if it is strong/weak. 


2 
top! 0.376 1218 206]0.182 255 120 : ; ; : 
random| 0.488 2372 2236/0.233 520 1445 To investigate (a), we assign a weight to 


2 

WRN Jous |0.747 2901 906] 0.56 917 610 each input point, indicating how likely it 
topl 0.35 889 2220183 189 90 ‘ 

random|0.395 1534 2273/0154 261 1296 gets selected to train DEEPROBuSsT-W. 

VGG Jours | 0.654 2222 938/0.493 747 543 i i j i 

topl | 0.439 1070 1530.266 278 106 In particular, for an input tra weight 

random |0.332 1127 2033|0.132 200 1090 q; :— 1+(1—n:)™ x100 is computed where 

SVHN ResN |ours 0.755 6814 2530 |0.577 674 + š i 1+100%™ i z 

topl 0.315 1665 142] 0.267 122 n is its neighbor accuracy, and m is a 

random | 0.343 3095 6249 | 0.086 1878 fi bl ter: ith l 

WRN Jours [0.709 5062 2143 | 0.582 ios CONUGUTADIE parameter; Wi arger ™, 

tapi. /0.202 1238 130/0203 8 more weak points are sampled and DEEP- 

random| 0.28 2000 5205 |0.095 2230 


VGG Jours [0.595 5214 3367 [0.498 o1  ROBUST-W will be trained with more 
52 


top! [0.172 840 67] 0.139 kpa dvi 
random | 0.341 2986 5595 | 0.094 143 weak points, and vice versa. 
F-MNIST |ResN Jours | 0.914 6034 873]0.791 4 556 Table 5A shows the performance: as 


topl 0.124 428 11 |€ 57 7 
random | 0.657 4340 2567 |( 


n2 1988 m increases, the detector trades precision 
WRN rs 0.896 5743 652| 0.76 2033 641 . . . 

top: Joia 49 loos 63s for recall. In this way, choosing different 
rand 0.638 4093 2302]0.281 752 1922 ee 
__ om Ton T LA values of m, the precision-recall trade-off 
VGG |ours 0.864 6348 1231|0.654 1895 1082 3 d 
top) |0104 392 50028 39 5 ofthe detector can be adjusted according 
random | 0.734 5393 2186 |0.295 854 2123 ; A 

to a user’s need. From a different perspec- 

tive, this way of oversampling weak points also addresses the potential problem 
of imbalanced data when the weak points are much less than the strong points. 
Table 5: DEEPROBUST-W performance using different sampling strategies 


for training 


B: with varying number of neigbours 


dataset #neighbors| prec recall tp fp fl 

A: with varying number of strong/weak points CIFAR-10 6/0.662 493 0.49 
dataset m| prec recall tp fp fl 12 0:685 ee 
25) 0.665 629 0.572 

CIFAR-10] 0/0.660 0.518 1290 664 0.581 500.660 0.518 1290 664 0.581 
1| 0.615 0.599 1490 932 0.607 200|0.683 0.507 1261 585 0.582 

2 |s0;08810:809 LO T00 0.612 SVHN 6]0.723 0.403 1136 436 0.518 

SVHN 0|0.677 0.502 1414 674 0.577 12|0.672 0.527 1483 725 0.59 
1) 0.575 0.653 1837 1357 0.612 25/0.619 0.629 1771 1090 0.624 

2| 0.332 0.767 2160 4356 0.463 500.632 0.605 1703 993 0.618 

F-MNIST | 0/0.794 0.787 2144 556 0.791 200/Q:667 10:550 1h50 : rd 0:603 
1| 0.746 0.839 2284 777 0.79 F-MNIST 6|0.817 0.727 1981 443 0.77 

2| 0.712 0.871 2372 962 0.783 12|0.784 0.790 2153 592 0.787 

25/0.773 0.787 2143 629 0.78 

500.836 0.727 1981 390 0.778 

200|0.778 0.812 2211 632 0.794 


Next, we check how DEEPROBUST-W’s performance is dependent on the 
number of sampled neighbors, because a data point can potentially have infinite 
neighbors. Table 5B shows that the number of neighbors does not have much 
influence on the performance of the detector once it goes beyond some value (F1 
score change less than 3.5 percentage point between 25 and 200 samples) for 
all the three datasets. Thus, we choose 50 for all of our experiments. For future 
work, a statistical bound with confidence intervals for neighbor accuracy can be 
estimated by modeling neighbor accuracy using distributions like folded normal. 
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Result 2: DEEPROBUST-W can identify weak points with reasonably high 
F1 score: on average 76.9%, at 0.75 neighbor accuracy cut-off. 


RQ3. Can we identify the weak points in a black-box set- 
ting? 

We explore this RQ using DEEPROBuUST-B, as discussed in Section 3.3. We 
assume only having access to unlabeled testing data and the model under test as 
a black-box. To evaluate DEEPROBUST-B, we spatially transform each test input 
m times by randomly applying dw € [—30, +30] degrees rotation, dx € [—3, +3] 
pixels horizontal translation, and dy € [—3,+3] pixels vertical translation. We 
then calculate the output diversity score (A) based on Equation (1) and rank 
the test images based on A. Finally, we mark top k images as potential most 
non-robust points. The parameter k is chosen according to users’ need. 
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Fig.6: The spearman correlation 
coeff. between diversity score (A) 
and neighbor accuracy, with vary- nel tee 
ing #neighbors (m). O Too os a 

With each test data, DEEP- Fig.7: AUC-ROC curve with neighbor 
RoBustT-B queries the model with m accuracy cutoff at 0.75. The red ver- 
neighbors to compute À. Since query- tical line indicates when the diversity 
ing the classifier comes with an over- score threshold is chosen from train- 
head, our goal is to achieve an optimal 18 data. 
accuracy with minimal queries (i.e., m). To determine an optimal m value, we 
explore the spearman correlation between diversity score and neighbor accuracy, 
with varying m, when running ResN on all the three datasets (see Figure 6). The 
correlation increases as m increases, as with more query À becomes more accu- 
rate, and so the neighbor accuracy. We notice that at m = 15, the correlation 
coefficients across all the experimental settings reach above 0.8, and the rate of 
increase begins to slow down significantly. The results for the other two archi- 
tectures are highly similar. Thus, we set m = 15 as default for DEEPROBUST-B. 

Next, we evaluate DEEPROBUST-B’s performance. We plot AUC-ROC by 
changing top — k at m = 15 and compare our method with the random baseline 
and the top1 baseline as before. As shown in Figure 7, our method performs much 
better than the random baseline. In particular, our proposed method achieves 
AUC higher than 0.87 for all settings when neighbor accuracy cutoff is 0.5 and 
0.97 when neighbor accuracy cutoff is 0.75. 

Instead of above ranking based scheme, DEEP ROBUST-B can also be used as 
a Classifier if a diversity threshold is given (see Section 3.3). Here, we estimate 
the threshold using pre-annotated training data. 


00.0 
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Table 6: Performance of DEEP- 


We evaluate precision and recall of 
Rosust-B and the baseline meth- 


aes 7 DEEPROBUST-B in the nine DNN-dataset 
ods for predicting weak points. Se pe 5 
- - combinations under neighbor accuracy 
dataset model | method 75% 50% 
m p p| a p p Cutoffs 0.5 and 0.75. Table 6 shows the re- 
CIFAR-10|ResN |ours | 0.939 4714 257] 0.622 1454 801 gult. At 0.75 setting DEEPROBUST-B has 
top1 [0.376 1218 2060.182 255 120 at 
random | 0.501 2516 2455] 0.234 549 1706 f1 up to 99.1%, with an average of 96.5%. 
WRN |ours | 0.938 3657 wijoses osa 64 At 0.50 setting, DEEPROBUST-B detects 


topl 0.35 889 0.183 189 90 

ra .383 1494 233 .182 3 283 3 3 

random | 0.383 1494 2334 | 0.18: 307 1283 weak points with average fl of 72.9%, 
VGG | ours 0.945 3397 148] 0.682 1087 390 


top1 | 0.439 1070 153] 0.266 278 106 while it can go up to 85.7%. It consistently 
random | 0.36 1296 2249/0.153 244 1233 : : 
SVHN |ResN Jours | 0.956 8371 365| 0.67 1845 858 produces better estimation than the topl 


topl [0.315 1665 142|0.267 452 122 baseline and the random baseline. This 
random | 0.336 2944 5792 |0.102 280 2423 


WRN [oms |0963 6827 227|o7is 1602 s11 Shows that our black-box method can ef- 


topl [0.292 1238 1300.203 275 85 . . ; . 
random |0275 1960 s104 oos; 191 199,  Lectively identify weak points. 

VGG fours | 0.976 8608 144]0.779 2138 454 Note that, generating the spatial 
top! [0.172 840 67/0139 221 52 


random | 0.339 2997 5755|0.102 279 2313 transformations and querying the model 
F-MNIST |ResN Jours | 0.987 6422 81] 0.802 2316 546 ; ; 5 £ : s = 
ou i ae oo sy s with it under black box setting is fast. Pre 
random | 0.655 4265 2238|0.289 835 2027 vious black box methods for adversarial 


WRN 0.989 6246 70) 0.857 2297 360 . . . 

nt 0.144 290 ii Gas 63 8 perturbation work in such fashion [26, 51] f 
random | 0.631 3987 2329 |0.274 736 1921 For example, using CIFAR-10 when we 

VGG |ours 0.991 7078 60 | 0.847 2393 418 " 2 

topt [0.104 392 002 39 5 use a batch with size 100, the average 

random | 0.711 5084 2054 |0.277 784 2027 k 6 J 

transformation+query time for one image 

is 0.031 + 0.015 ms. For the other two datasets, the overhead is similar. Thus, 

to for m = 15 queries, it takes only 0.465 + 0.225 ms, which is a negligible over- 

head for most real-world DNN based vision applications. This implies that our 

black-box method can also be used in real time for many applications. 


Result 3: Given only black-box access to the DNN classifier, DEEP ROBUST- 
B can identify weak points with f1 that are much better than those of using 
top1 method or random method. 


RQ4. How generalizable are these findings? 

The local robustness issues also exist in more critical applications like self 
driving-car. Here we explore more complex transformations, i.e., adding rain and 
fog to the driving scenes. As shown in Figure 8, among those correctly classified 
data points, there is a non-trivial portion (45.8%) of them (in the heatmap, more 
red signified weaker) suffer from low (<0.75) neighbor accuracy. 

Note that, here, we test regression models, which take images of driving 
scenes as inputs and output the corresponding steering angles. 

Let a set of outputs predicted by a DNN be denoted by {6o1, Dic deze Don}; and 
ground truth labels for the original (unmodified) image points be {0}, 02, ..., An}. 
If the difference between predicted steering angle 6,; of a transformed image and 
the ground truth label of the original image 6; is above a threshold, we consider 
it as incorrect. 

The threshold AM S Eorig is defined following DeepTest’s [73] as MSEorig = 
157" (b; — ĝi)? . MSE is the Mean Square Error between the outputs and 


n i=l 
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the manual labels, and A is a positive coefficient that is chosen to reflect a user’s 
tolerance on the deviation. Note that there is no softmax layer (and thus no 
confidence score) in these regression models so the top1 baseline method cannot 
be used here. 
Table 7 shows the result when A = 3. 

At 0.75 setting, DEEPROBUST-W has f1 
score up to 78.9%, with an average of 
58.2%. At 0.50 setting, DEEPROBUST-W f 
detects weak points with an average f1 of a 
47.9%, while it can go up to 68.2%. It con- “o 
sistently produces better estimation than "t 
the random baseline under all the settings. 
It should be noted that our observation is classified data suite: fom Sele: 
valid for all the A used in [73] from A equal Driving dataset by the epoch 
to 1 to 5. This shows that our proposed model. data points are colored 
method DEEPROBUST-W can be applied based on neighbor accuracy. 
to regression problems with more complex natural transformations. 
model | method [0:75 neighbor ace. | 0.80 neighbor acc. It should also be noted that it is un- 

| | f tp p| f tp f realistic to use DEEPROBUST-B for this 


Fig. 8: The t-SNE plot of correctly 


chauffeur ours | 0.417 555 547/0.346 339 384 task for two reasons: It is impractical 
random |0.146 194 908]0.096 94 629 diffi PRT f á 

epoch ours | 0.789 4354 aa 2641 1127 to try ifferent variations of an image 
random | 0.586 3234 2232]0.411 1592 2176 in real-time for a self-driving car, which 

dave2 ours |0.541 979 471/0.409 475 246 a ‘me an ; ; 
rani lois 360 uooloia i ss 28 a time-sensitive application. Further, 


Table 7. Performance of Derep- DEEPROBustT-B requires the calculation 
Rosust-W for predicting weak Of neighbor diversity score. For a regres- 
points of Self-Driving dataset sion problem, the predicted values are 
continuous, so there is a very low probability for any two predictions being 
equal. Thus, the neighbor diversity score for every data point will be the same 
and cannot be used for identifying the weak points. 


Result 4: DEEPROBUST-W can detect weak points of a self-driving car 
dataset with f1 score up to 78.9%, with an average of 58.2%, at neighbor 
accuracy cutoff 0.75. 


6 Related Work 


Adversarial examples. Many works focus on generating adversarial examples 
to fool the DNNs and evaluate their robustness using pixel-based perturbation 
[9, 17, 23, 25, 31, 36, 48, 49, 54, 63, 80-83]. Some other papers [14, 15, 86], like us, 
proposed more realistic transformations to generate adversarial examples. In par- 
ticular, Engstrom et al. [14] proposed that a simple rotation and translation can 
fool a DNN based classifier, and spatial adversarial robustness is orthogonal to 
lp-bounded adversarial robustness. However, all these works estimate the overall 
robustness of a DNN based on its aggregated behavior across many data points. 
In contrast, we analyze the robustness of individual data points under natural 
variations and propose methods to detect weak/strong points automatically. 
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DNN testing. Many researchers [16, 21, 29, 36, 41, 55, 69, 70, 74, 94] proposed 
techniques to test DNN. For example, Pei et al. [55] proposed an image transfor- 
mation based differential testing framework, which can detect erroneous behavior 
by comparing the outputs of an input image across multiple DNNs. Ferit et al. 
[16] used fault localization methods to identify suspicious neurons and leveraged 
those to generate adversarial test cases. 


In contrast, others [8, 29, 64,73, 78, 92,94] used metamorphic testing where 
the assumption is the outputs of an original and its transformed image will be 
the same under natural transformations. Among them, some use a uncertainty 
measure to quantify some types of non-robustness of an input for prioritizing 
samples for testing / retraining [8] or generating test cases[78]. We follow a simi- 
lar metamorphic property while estimating neighbor accuracy and our proposed 
DEEPROBUST-B also leverages an uncertainty measure. The key differences are: 
First, we focus on estimating model’s performance on general natural variants of 
an input rather than the input itself or only spatial variants. Second, we focus on 
the task of weak points detection rather than prioritizing / generating test cases. 
We also give detailed analyses of the properties of natural variants and propose 
a feature vector based white-box detection method DEEPROBUST-W. Further, 
we show that our method works across domains (both image classification and 
self-driving car controllers) and tasks (both classification and regression). Other 
uncertainty work complement ours in the sense that we can easily leverage weak 
points identified by DEEPROBUST-W and DEEPROBUST-B to prioritize test 
cases or generate more adversarial cases of natural variants. 


Another line of work [18, 19, 27, 33, 34, 58,72] estimates the confidence of 
a DNN’s output. For example, [19] leverages thrown away information from 
existing models to measure confidence; [27] shows other NN properties like depth, 
width, weight decay, and batch normalization are important factors influencing 
prediction confidence. Although such methods can provide a confidence measure 
per input or its adversarial variants, they do not check its natural robustness 
property, i.e., with natural variations how will they behave. 


DNN verification. There also exist work on verifying properties for a DNN 
model [7, 12, 24, 30,56, 62, 83]. Most of them focus on verifying properties on lp 
norm bounded input space. Recently, Balunovic et al.[4] provides the first ver- 
ification technique for verifying a data point’s robustness against spatial trans- 
formation. However, their technique suffers from scalability issues. 


Robust training. Regular neural network training involves the optimization 
of the loss for each data point. Robust training of neural network works on 
minimizing the largest loss within a bounded region usually using adversarial 
examples [15,35,43,45,50,75,81,83,84]. While both robust training methods and 
our work generate variants of data points, instead of training a model with these 
variants to improve robustness, we use them to estimate the robustness of unseen 
data points. The relation between robust retraining and our work is similar to 
bug fixing vs. bug detection in traditional software engineering literature. 
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7 Threats to Validity 


We adopt rotation and translation as transformations for image classification 
tasks and rain and fog effects for the self-driving car task. There are many more 
natural variations such as brightness, snow effect etc. However, rotation and 
translation are representative of spatial transformation and used by many paper 
in evaluating robustness of DNN models[14, 55]. Rain and fog effects are also 
widely leveraged in many influential studies on testing self-driving cars [55,73,92]. 

Besides, for some of the experiments we did not show all the combinations 
under both neighbor accuracy cutoffs (i.e. 0.5 and 0.75). However, we note that 
the observations are consistent and we did not include them purely because 
of space limitation. Another limitation is that for both DEEPROBUST-W and 
DEEPROBUST-B, we need to decide the number of neighbors to use for training a 
classifier and estimating A, respectively. We mitigate it by selecting the neighbor 
numbers that give stable performance in terms of precision and recall. 


8 Conclusion and Future Work 


In this work, we involve the data characteristic into the robustness testing of 
DNN models. We adopt the concept of neighbor accuracy as a measure for local 
robustness of a data point on a given model. We explore the properties of neigh- 
bor accuracy and find that weak points are often located towards corresponding 
class boundaries and their transformed versions tend to be predicted to be more 
diverse classes. Leveraging these observations, we propose a white-box method 
and a black-box method to identify weak/strong points to warn a user about po- 
tential weakness in the given trained model in real-time. We design, implement 
and evaluate our proposed framework, DEEPROBUST-W and DEEPROBUST-B, 
on three image recognition datasets and one self-driving car dataset (for DEEP- 
RoBustT-W only) with three models for each. The results show that they can 
effectively identify weak/strong points with high precision and recall. 

For future work, other consistency analysis methods [18] e.g. variation ratio, 
entropy can be tried. We can potentially attain statistical guarantee for our 
black-box method by modeling the neighbor accuracy distribution and assume 
certain level of correlation between neighbor accuracy and complexity score. 
Besides, other definitions of robustness like consistency can be explored. We can 
also leverage ideas from [8,78] to easily prioritize test cases or generate more hard 
test cases based on identified weak points. Further, we can potentially modify 
existing fixing methods such as [20] targeting the weak points to fix them. 
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Abstract. This report describes Test-Comp 2021, the 3rd edition of the 
Competition on Software Testing. The competition is a series of annual 
comparative evaluations of fully automatic software test generators for C 
programs. The competition has a strong focus on reproducibility of its 
results and its main goal is to provide an overview of the current state 
of the art in the area of automatic test-generation. The competition was 
based on 3173 test-generation tasks for C programs. Each test-generation 
task consisted of a program and a test specification (error coverage, 
branch coverage). Test-Comp 2021 had 11 participating test generators 
from 6 countries. 
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1 Introduction 


Among several other objectives, the Competition on Software Testing (Test- 
Comp [4, 5, 6], nttps://test-comp.sosy-lab.org/2021) showcases every year the state 
of the art in the area of automatic software testing. This edition of Test-Comp 
is the 3rd edition of the competition. It provides an overview of the currently 
achieved results by tool implementations that are based on the most recent ideas, 
concepts, and algorithms for fully automatic test generation. This competition 
report describes the (updated) rules and definitions, presents the competition 
results, and discusses some interesting facts about the execution of the competition 
experiments. The setup of Test-Comp is similar to SV-COMP [8], in terms 
of both technical and procedural organization. The results are collected via 
BENCHEXEC’s XML results format [16], and transformed into tables and plots 
in several formats (https: //test-comp.sosy-lab.org/2021/results/). All results are 
available in artifacts at Zenodo (Table 3). 


This report extends previous reports on Test-Comp [4, 5, 6]. 
Reproduction packages are available on Zenodo (see Table 3). 
Funded in part by the Deutsche Forschungsgemeinschaft (DFG) — 418257054 (Coop). 
2< dirk. beyer@sosy-lab.org 
© The Author(s) 2021 


E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 341-357, 2021. 
https: //doi.org/10.1007/978-3-030-71500-7_17 
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Competition Goals. In summary, the goals of Test-Comp are the following [5]: 


e Establish standards for software test generation. This means, most promi- 
nently, to develop a standard for marking input values in programs, define 
an exchange format for test suites, agree on a specification language for 
test-coverage criteria, and define how to validate the resulting test suites. 

e Establish a set of benchmarks for software testing in the community. This 
means to create and maintain a set of programs together with coverage 
criteria, and to make those publicly available for researchers to be used in 
performance comparisons when evaluating a new technique. 

e Provide an overview of available tools for test-case generation and a snapshot 
of the state-of-the-art in software testing to the community. This means to 
compare, independently from particular paper projects and specific techniques, 
different test generators in terms of effectiveness and performance. 

e Increase the visibility and credits that tool developers receive. This means 
to provide a forum for presentation of tools and discussion of the latest 
technologies, and to give the participants the opportunity to publish about 
the development work that they have done. 

e Educate PhD students and other participants on how to set up performance 
experiments, package tools in a way that supports reproduction, and how to 
perform robust and accurate research experiments. 

e Provide resources to development teams that do not have sufficient computing 
resources and give them the opportunity to obtain results from experiments 
on large benchmark sets. 


Related Competitions. In the field of formal methods, competitions are re- 
spected as an important evaluation method and there are many competitions [2]. 
We refer to the previous report [5] for a more detailed discussion and give here 
only the references to the most related competitions [2, 8,32, 39]. 


Quick Summary of Changes. As the competition continuously improves, 
we report the changes since the last report. We list a summary of five new 
items in Test-Comp 2021 as overview: 


e Extended task-definition format, version 2.0: Sect. 2 

e SPDX identification of licenses in SV-Benchmarks collection: Sect. 2 

e Extension of the SV-Benchmarks collection by several categories: Sect. 3 

e Elimination of competition-specific functions __VERIFIER_error and 
__VERIFIER_assume from the test-generation tasks (and rules): Sect. 3 

e CoVERITEAM: New tool that can be used to remotely execute test-generation 
runs on the competition machines: Sect. 4 


2 Definitions, Formats, and Rules 


Organizational aspects such as the classification (automatic, off-site, reproducible, 
jury, training) and the competition schedule is given in the initial competi- 
tion definition [4]. In the following, we repeat some important definitions that 
are necessary to understand the results. 
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Program 
under Test 
Test Suite 
| (Test Cases) 
Test Test 
Generator Validator 

Coverage 
Statistics 


Test 
Specification 


Fig. 1: Flow of the Test-Comp execution for one test generator (taken from [5]) 


Test-Generation Task. A test-generation task is a pair of an input program 
(program under test) and a test specification. A test-generation run is a non- 
interactive execution of a test generator on a single test-generation task, in 
order to generate a test suite according to the test specification. A test suite 
is a sequence of test cases, given as a directory of files according to the for- 
mat for exchangeable test-suites.' 


Execution of a Test Generator. Figure 1 illustrates the process of executing 
one test generator on the benchmark suite. One test run for a test generator gets 
as input (i) a program from the benchmark suite and (ii) a test specification 
(cover bug, or cover branches), and returns as output a test suite (i.e., a set of 
test cases). The test generator is contributed by a competition participant as 
a software archive in ZIP format. The test runs are executed centrally by the 
competition organizer. The test-suite validator takes as input the test suite from 
the test generator and validates it by executing the program on all test cases: 
for bug finding it checks if the bug is exposed and for coverage it reports the 
coverage. We use the tool TesTCov [15]? as test-suite validator. 


Test Specification. The specification for testing a program is given to the 
test generator as input file (either properties/coverage-error-call.prp or 
properties/coverage-branches.prp for Test-Comp 2021). 

The definition init(main()) is used to define the initial states of the pro- 
gram under test by a call of function main (with no parameters). The defini- 
tion FQL(£) specifies that coverage definition f should be achieved. The FQL 
(FSHELL query language [28]) coverage definition COVER EDGES (@DECISIONEDGE) 
means that all branches should be covered (typically used to obtain a 
standard test suite for quality assurance) and COVER EDGES (@CALL(foo) ) 
means that a call (at least one) to function foo should be covered (typ- 
ically used for bug finding). A complete specification looks as follows: 
COVER( init(main()), FQL(COVER EDGES(@DECISIONEDGE)) ). 


l https: //gitlab.com/sosy-lab/software/test-format/ 
2 https: //gitlab.com/sosy-lab/software/test-suite-validator 
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Table 1: Coverage specifications used in Test-Comp 2021 (similar to 2019, 2020) 


Formula Interpretation 


COVER EDGES (@CALL(reach_error)) The test suite contains at least one test 
that executes function reach_error. 

COVER EDGES (@DECISIONEDGE) The test suite contains tests such that 
all branches of the program are executed. 


format_version: '2.0’ 


# old file name: floppy_true—unreach—call_true—valid—memsafety.i.cil.c 
input_files: *floppy.i.cil—3.c" 


properties: 
— property_file: ../properties/unreach—call.prp 
expected_verdict: true 
— property_file: ../properties/valid—memsafety.prp 
10 expected_verdict: false 
ti subproperty: valid—memtrack 
12 — property_file: ../properties/coverage—branches.prp 


OoOMONOAa AR WN HB 


i4 options: 
15 language: C 
16 data_model: ILP32 


Fig.2: Example task definition file floppy.i.cil-3.yml for C program 
floppy.i.cil-3.c (format version and options are new compared to last year) 


Table 1 lists the two FQL formulas that are used in test specifications of 
Test-Comp 2021; there was no change from 2020 (except that special function 
__VERIFIER_error does not exist anymore). 


Task-Definition Format 2.0. The format for the task defi- 
nitions in the $SV-Benchmarks repository was extended by op- 
tions that can carry information from the test-generation task 
to the test tool. Test-Comp 2021 used the format in version 2.0 
(https: //gitlab.com/sosy-lab/benchmarking/task-def inition-format/-/tree/2.0) 

The options now contain the language (C or Java) and the data 
model (ILP32, LP64, see http://www.unix.org/whitepapers/64bit html, only 
for C programs) that the program of the test-generation task assumes 
(https: //github.com/sosy-lab/sv-benchmarks#task-definitions). An example task 
definition is provided in Fig. 2: This YAML file specifies, for the C program 
floppy.i.cil-3.c, two verification tasks (reachability of a function call 
and memory safety) and one test-generation task (coverage of all branches). 
Previously, the options for language and data model where defined in 
category-specific configuration files (for example c/ReachSafety-ControlFlow.cfg), 
which were deleted before Test-Comp 2021. 
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License and Qualification. The license of each participating test genera- 
tor must allow its free use for reproduction of the competition results. De- 
tails on qualification criteria can be found in the competition report of Test- 
Comp 2019 [6]. Furthermore, the community tries to apply the SPDX stan- 
dard (nttps://spdx.dev) to the SV-Benchmarks repository. Continuous-integration 
checks based on REUSE (nttps://reuse.software) will ensure that all benchmark 
tasks adhere to the standard. 


3 Categories and Scoring Schema 


Benchmark Programs. The input programs were taken from the largest and 
most diverse open-source repository of software-verification and test-generation 
tasks, which is also used by SV-COMP [8]. As in 2020, we selected all pro- 
grams for which the following properties were satisfied (see issue on GitHub + 
and report [6]): 


compiles with gcc, if a harness for the special methods ° is provided, 
should contain at least one call to a nondeterministic function, 

does not rely on nondeterministic pointers, 

does not have expected result ‘false’ for property ‘termination’, and 

has expected result ‘false’ for property ‘unreach-call’ (only for category Error 
Coverage). 


Oe oe: Ro 


This selection yielded a total of 3173 test-generation tasks, namely 607 tasks 
for category Error Coverage and 2566 tasks for category Code Coverage. The 
test-generation tasks are partitioned into categories, which are listed in Ta- 
bles 6 and 7 and described in detail on the competition web site.° Figure 3 
illustrates the category composition. 

The programs in the benchmark collection contained functions 
__VERIFIER_error and __VERIFIER_assume that had a specific prede- 
fined meaning. Last year, those functions were removed from all programs 
in the SV-Benchmarks collection. More about the reasoning is explained 
in the SV-COMP 2021 competition report [8]. 


Category Error-Coverage. The first category is to show the abilities to dis- 
cover bugs. The benchmark set consists of programs that contain a bug. Every 
run will be started by a batch script, which produces for every tool and every 
test-generation task one of the following scores: 1 point, if the validator succeeds 
in executing the program under test on a generated test case that explores the 
bug (i.e., the specified function was called), and 0 points, otherwise. 


3 https: //github.com/sosy-lab/sv-benchmarks 

4 nttps://github.com/sosy-lab/sv-benchmarks/pul1/774 
5 https: //test-comp.sosy-lab.org/2021/rules. php 

6 https: //test-comp.sosy-lab.org/2021/benchmarks . php 
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Fig. 3: Category structure for Test-Comp 2021; compared to Test-Comp 2020, 
there are three new sub-categories in Cover-Error and two new sub-categories 
in Cover-Branches: we added the sub-categories XCSP, BusyBox-MemSafety, 
and DeviceDriversLinux64-ReachSafety to category Cover-Error, and the sub- 
categories XCSP and Combinations to category Cover-Branches 


Category Branch-Coverage. The second category is to cover as many branches 
of the program as possible. The coverage criterion was chosen because many 
test generators support this standard criterion by default. Other coverage cri- 
teria can be reduced to branch coverage by transformation [27]. Every run 
will be started by a batch script, which produces for every tool and every 
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test-generation task the coverage of branches of the program (as reported by 
TestCov [15]; a value between 0 and 1) that are executed for the generated 
test cases. The score is the returned coverage. 


Ranking. The ranking was decided based on the sum of points (normalized for 
meta categories). In case of a tie, the ranking was decided based on the run time, 
which is the total CPU time over all test-generation tasks. Opt-out from categories 
was possible and scores for categories were normalized based on the number of 
tasks per category (see competition report of SV-COMP 2013 [3], page 597). 


4 Reproducibility 


In order to support independent reproduction of the Test-Comp results, we 
made all major components that are used for the competition available in public 
version-control repositories. An overview of the components that contribute to 
the reproducible setup of Test-Comp is provided in Fig. 4, and the details are 
given in Table 2. We refer to the report of Test-Comp 2019 [6] for a thorough 
description of all components of the Test-Comp organization and how we ensure 
that all parts are publicly available for maximal reproducibility. 

In order to guarantee long-term availability and immutability of the test- 
generation tasks, the produced competition results, and the produced test suites, 
we also packaged the material and published it at Zenodo (see Table 3). The 
archive for the competition results includes the raw results in BENCHEXEC’s 
XML exchange format, the log output of the test generators and validator, 
and a mapping from file names to SHA-256 hashes. The hashes of the files 
are useful for validating the exact contents of a file, and accessing the files 
inside the archive that contains the test suites. 

To provide transparent access to the exact versions of the test generators that 
were used in the competition, all test-generator archives are stored in a public 
Git repository. GITLAB was used to host the repository for the test-generator 
archives due to its generous repository size limit of 10GB. 


Competition Workflow. As illustrated in Fig. 4, the ingredients for a test or 
verification run are (a) a test or verification task (which program and which 
specification to use), (b) a benchmark definition (which categories and which 
options to use), (c) a tool-info module (uniform way to access a tool’s version 
string and the command line to invoke), and (d) an archive that contains all 
executables that are required and cannot be installed as standard Ubuntu package. 

(a) Each test or verification task is defined by a task-definition file (as shown, 
e.g., in Fig. 2). The tasks are stored in the SV-Benchmarks repository and 
maintained by the verification and testing community, including the competition 
participants and the competition organizer. 

(b) A benchmark definition defines the choices of the participating team, that 
is, which categories to execute the test generator on and which parameters to 
pass to the test generator. The benchmark definition also specifies the resource 
limits of the competition runs (CPU time, memory, CPU cores). The benchmark 
definitions are created or maintained by the teams and the organizer. 
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(a) Test-Generation Tasks 


(b) Benchmark Definitions 


(e) Test-Generation Run 


(f) Test Suite 


(c) Tool-Info Modules 


(d) Tester Archives 


Fig. 4: Benchmarking components of Test-Comp and competition’s execution flow 
(same as for Test-Comp 2020) 


Table 2: Publicly available components for reproducing Test-Comp 2021 


Component Fig. 4 Repository Version 


Test-Generation Tasks (a)  github.com/sosy-lab/sv-benchmarks testcomp21 
Benchmark Definitions (b)  gitlab.com/sosy-lab/test-comp/bench-defs testcomp21 
Tool-Info Modules (c) github. com/sosy-lab/benchexec 3:6 
Test-Generator Archives (d) gitlab.com/sosy-lab/test-comp/archives-2021 testcomp21 
Benchmarking (e 

Test-Suite Format (£ 


)  github.com/sosy-lab/benchexec 3.6 
)  gitlab.com/sosy-lab/software/test-format testcomp21 


Table 3: Artifacts published for Test-Comp 2021 


Content DOI Reference 


Test-Generation Tasks 10.5281/zenodo.4459132 [9] 
Competition Results 10.5281/zenodo.4459470 [7] 
Test Suites (Witnesses) 10.5281/zenodo.4459466 [10] 
BenchExec 10.5281/zenodo.4317433 [43] 


(c) A tool-info module is a component that provides a uniform way to 
access the test-generation or verification tool: it provides interfaces for access- 
ing the version string of a test generator and assembles the command-line 
from the information given in the benchmark definition and task definition. 
The tool-info modules are written by the participating teams with the help 
of the BENCHEXEC maintainer and others. 


(d) A test generator is provided as an archive in ZIP format. The archive 
contains a directory with a README and LICENSE file as well as all components 
that are necessary for the test generator to be executed. This archive is created by 
the participating team and merged into the central repository via a merge request. 


All above components are reviewed by the competition jury and improved 
according to the comments from the reviewers by the teams and the organizer. 
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Table 4: Competition candidates with tool references and representing jury members 


Tester Ref. Jury member Affiliation 

CMA-ES Fuzz [33] Gidon Ernst LMU Munich, Germany 
CoVeriTest [12,31] Marie-Christine Jakobs TU Darmstadt, Germany 
FuSeEBMC [1,25] Kaled Alshmrany U. of Manchester, UK 
HysrwtTicer [18,38] Sebastian Ruland TU Darmstadt, Germany 

KLEE [19,20] Martin Nowack Imperial College London, UK 
LEGION [37] Dongge Liu U. of Melbourne, Australia 
LIBKLUZZER [35] Hoang M. Le U. of Bremen, Germany 

PRTest [14,36] Thomas Lemberger LMU Munich, Germany 
SYMBIOTIC [21,22] Marek Chalupa Masaryk U., Brno, Czechia 
TRACERX [29,30] Joxan Jaffar National U. of Singapore, Singapore 
VeriFuzz [23] Raveendra Kumar M. Tata Consultancy Services, India 


Due to the reproducibility requirements and high level of automation that 
is necessary for a competition like Test-Comp, participating in the competi- 
tion is also a challenge itself: package the tool, provide meaningful log output, 
specify the benchmark definition, implement a tool-info module, and trouble- 
shoot in case of problems. Test-Comp is a friendly and helpful community, 
and problems are reported in a GitLab issue tracker, where the organizer and 
the other teams help fixing the problems. 

To provide participants access to the actual competition machines, the com- 
petition used CoVERITEAM [13] (https: //gitlab.com/sosy-lab/software/coveriteam/) 
for the first time. CoVERITEAM is a tool for cooperative verification, which 
enables remote execution of test-generation or verification runs directly on the 
competition machines (among its many other features). This possibility was 
found to be a valuable service for trouble shooting. 


5 Results and Discussion 


For the third time, the competition experiments represent the state of the 
art in fully automatic test generation for whole C programs. The report helps 
in understanding the improvements compared to last year, in terms of effec- 
tiveness (test coverage, as accumulated in the score) and efficiency (resource 
consumption in terms of CPU time). All results mentioned in this article were 
inspected and approved by the participants. 


Participating Test Generators. Table 4 provides an overview of the par- 
ticipating test generators and references to publications, as well as the team 
representatives of the jury of Test-Comp 2021. (The competition jury consists of 
the chair and one member of each participating team.) Table 5 lists the features 
and technologies that are used in the test generators. An online table with infor- 
mation about all participating systems is provided on the competition web site.” 


T https: //test-comp.sosy-lab.org/2021/systems.php 
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Table 5: Technologies and features that the competition candidates used 
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Computing Resources. The computing environment and the resource lim- 
its were the same as for Test-Comp 2020 [5]: Each test run was limited to 
8 processing units (cores), 15GB of memory, and 15min of CPU time. The 
test-suite validation was limited to 2 processing units, 7GB of memory, and 
5min of CPU time. The machines for running the experiments are part of a 
compute cluster that consists of 168 machines; each test-generation run was 
executed on an otherwise completely unloaded, dedicated machine, in order to 
achieve precise measurements. Each machine had one Intel Xeon E3-1230 v5 
CPU, with 8 processing units each, a frequency of 3.4GHz, 33GB of RAM, 
and a GNU/Linux operating system (x86_ 64-linux, Ubuntu 20.04 with Linux 
kernel 5.4). We used BENCHExEc [16] to measure and control computing resources 
(CPU time, memory, CPU energy) and VERIFIERCLOUD ® to distribute, install, 
run, and clean-up test-case generation runs, and to collect the results. The values 


8 https: //vcloud.sosy-lab.org 
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Table 6: Quantitative overview over all results; empty cells mark opt-outs; label ‘new’ 
indicates first-time participants 
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for time and energy are accumulated over all cores of the CPU. To measure the 
CPU energy, we use CPU ENERGY METER |17] (integrated in BENcHExec [16]). 
Further technical parameters of the competition machines are available in the 
repository which also contains the benchmark definitions. ° 

One complete test-generation execution of the competition consisted of 
34903 single test-generation runs. The total CPU time was 220 days and the 
consumed energy 56 kWh for one complete competition run for test generation 
(without validation). Test-suite validation consisted of 34903 single test-suite 
validation runs. The total consumed CPU time was 6.3 days. Each tool was 
executed several times, in order to make sure no installation issues occur dur- 
ing the execution. Including preruns, the infrastructure managed a total of 
210632 test-generation runs (consuming 1.8 years of CPU time) and 207 459 
test-suite validation runs (consuming 27 days of CPU time). We did not mea- 
sure the CPU energy during preruns. 


Quantitative Results. Table 6 presents the quantitative overview of all tools 
and all categories. The head row mentions the category and the number of test- 
generation tasks in that category. The tools are listed in alphabetical order; every 
table row lists the scores of one test generator. We indicate the top three candi- 
dates by formatting their scores in bold face and in larger font size. An empty table 
cell means that the test generator opted-out from the respective main category 


9 https: //gitlab.com/sosy-lab/test-comp/bench-defs/tree/testcomp21 
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Table 7: Overview of the top-three test generators for each category (measurement 
values for CPU time and energy rounded to two significant digits) 


Rank Tester Score CPU CPU 
Time Energy 
(inh) (in kWh) 


Cover-Error 


1 FuSEBMC 405 22 0.26 
2 VERIFUZZ 385 2.6 0.031 
3 LIBKLUZZER 359 90 0.99 
Cover- Branches 

1 VERIFUZZ 1389 630 8.1 
2 LIBKLUZZER 1292 520 5.7 
3 SYMBIOTIC 1169 440 5.1 
Overall 

1 VERIFUZZ 1865 640 8.1 
2 FuSEBMC 1776 410 4.8 
3 LIBKLUZZER 1738 610 6.7 


(perhaps participating in subcategories only, restricting the evaluation to a specific 
topic). More information (including interactive tables, quantile plots for every 
category, and also the raw data in XML format) is available on the competition 
web site 1° and in the results artifact (see Table 3). Table 7 reports the top three 
test generators for each category. The consumed run time (column ‘CPU Time’) 
is given in hours and the consumed energy (column ‘Energy’) is given in kWh. 


Score-Based Quantile Functions for Quality Assessment. We use score- 
based quantile functions [16] because these visualizations make it easier to 
understand the results of the comparative evaluation. The web site !° and the 
results artifact (Table 3) include such a plot for each category; as example, 
we show the plot for category Overall (all test-generation tasks) in Fig. 5. All 
11 test generators participated in category Overall, for which the quantile plot 
shows the overall performance over all categories (scores for meta categories 
are normalized [3]). A more detailed discussion of score-based quantile plots for 
testing is provided in the previous competition report [6]. 


Alternative Rankings. Table 8 is similar to Table 7, but contains the alternative 
ranking categories Green Testing and New Test Generators. Column ‘Quality’ 
gives the score in score points (sp), column ‘CPU Time’ the CPU usage in 
hours (h), column ‘CPU Energy’ the CPU usage in kilo-watt-hours (kWh), and 
column ‘Rank Measure’ reports the values for the rank measure, which is different 
for the two alternative ranking categories. (An entry ‘—’ for ‘CPU Energy’ indicates 
that we did not measure the energy consumption for technical reasons.) 


10 https: //test-comp.sosy-lab.org/2021/results 
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Fig. 5: Quantile functions for category Overall. Each quantile function illustrates 
the quantile (z-coordinate) of the scores obtained by test-generation runs below 
a certain number of test-generation tasks (y-coordinate). More details were given 
previously [6]. The graphs are decorated with symbols to make them better 
distinguishable without color. 


Table 8: Alternative rankings; quality is given in score points (sp), CPU time 
in hours (h), energy in kilo-watt-hours (kWh), the first rank measure in kilo- 
joule per score point (kJ/sp), and the second rank measure in score points (sp); 
measurement values are rounded to 2 significant digits 


Rank Test Generator Quality CPU CPU Rank 
Time Energy Measure 

(sp) (h) (kWh) 

Green Testing (kJ/sp) 

1 TRACERX 1315 210 2.5 6.8 

2 KLEE 1370 210 2.6 6.8 

3 FuSeBMC 1776 410 4.8 9.7 

worst 51 

New Test Generators (sp) 

1 FuSEBMC 1776 410 4.8 1776 

2 CMA-ES Fuzz 254 310 = 254 


Green Testing — Low Energy Consumption. Since a large part of the cost of 
test generation is caused by the energy consumption, it might be important to 
also consider the energy efficiency in rankings, as complement to the official 
Test-Comp ranking. This alternative ranking category uses the energy consump- 


tion per score point as rank measure: aie with the unit kilo-joule per 
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Fig. 6: Number of evaluated test generators for each year (top: number of first-time 
participants; bottom: previous year’s participants) 


score point (kJ/sp).'' The energy is measured using CPU Enercy METER [17], 
which we use as part of BENCHEXEc [16]. 


New Test Generators. To acknowledge the test generators that participated for the 
first time in Test-Comp, the second alternative ranking category lists measures 
only for the new test generators, and the rank measure is the quality with the 
unit score point (sp). For example, CMA-ES Fuzz is an early prototype and has 
already obtained a total score of 411 points in category Cover-Branches, and 
FuSEBMC is a new tool based on some mature components and became second 
place already in its first participation. This should encourage developers of test 
generators to participate with new tools of any maturity level. 


6 Conclusion 


Test-Comp 2021 was the the 3rd edition of the Competition on Software Testing, 
and attracted 11 participating teams (see Fig. 6 for the participation numbers and 
Table 4 for the details). The competition offers an overview of the state of the art in 
automatic software testing for C programs. The competition does not only execute 
the test generators and collect results, but also validates the achieved coverage 
of the test suites, based on the latest version of the test-suite validator TEsTCov. 
As before, the jury and the organizer made sure that the competition follows the 
high quality standards of the FASE conference, in particular with respect to the 
important principles of fairness, community support, and transparency. 


Data Availability Statement. The test-generation tasks and results of the 
competition are published at Zenodo, as described in Table 3. All compo- 
nents and data that are necessary for reproducing the competition are avail- 
able in public version repositories, as specified in Table 2. Furthermore, the 
results are presented online on the competition web site for easy access: 
https://test-comp.sosy-lab.org/2021/results/. 


11 Errata: Table 8 of last year’s report for Test-Comp 2020 contains a typo: The unit of the 
energy consumption per score point is kJ/sp (instead of J/sp). 
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Abstract. CoVeriTest, which is integrated in the analysis framework 
CPAcuscker, adopts verification technology for test-case generation. It 
encodes individual test goals as reachability queries, which are then pro- 
cessed by verifiers. To increase the effectiveness on a broad class of testing 
tasks, CoVeriTest leverages the strengths of two different analyses: an ex- 
plicit value analysis and predicate abstraction. Similar to TestComp’20, 
the two analyses are interleaved and the time duration of an interleaving 
segment is calculated dynamically. However, the calculation of the time 
duration focuses on the predicted future performance instead of the past 
performance, thus, rewarding analyses that likely cover open test goals. 


Keywords: Test-case generation - Cooperative Verification - CPACHECKER 


1 Test-Generation Approach 


Generating test-cases for a diverse set of tasks like in TestComp is challeng- 
ing and often cannot be performed effectively by a single approach. Therefore, 
cooperative approaches that combine the strengths of multiple test-case gen- 
erators frequently show superior performance as long as they do not spend too 
much time in unproductive test-case generators. To avoid unproductive test-case 
generation, we equip our COVERITEST submission with a novel learning-based 
scheduler that considers the expected productiveness of a test-case generator. 
CoVERITEST is a hybrid approach based on the concept of cooperative, 
verification-based testing [5], which combines complementary verifiers. In our 
current instantiation, we iteratively run two verification algorithms, namely value 
analysis and predicate analysis [3]. In each iteration, the analyses proceed 
their exploration until they hit their time limit. The time limit of an analysis is 
computed dynamically at the beginning of each iteration round using our novel 
learning-based time scheduler. To generate test cases, we encode the (open) test 
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Fig. 1. Our adaptive scheduler integrated in the workflow of CoVeriTrst 


goals, which are shared between the analyses, as unreachability queries and let 
the analyses prove the unreachability of those goals. A reported counterexample 
proves the reachability of a test goal. Therefore, the counterexample is converted 
into a test [I] and the test goal is removed from the set of open test goals. 
Time Scheduling. Our time scheduler limits the time per iteration round 
to 100 £] and distributes the 100s based on the expected contribution of the 
individual analyses. The idea is that an analysis gets more time if there exists 
more paths to open test goals that the analysis is expected to handle well. 
Figure [I] shows the integration of our time scheduler into the CoVERITEST 
workflow. First, the scheduler samples a set of syntactical counterexample paths p, 
which starts at the beginning of the program and ends in an open test goal. Then, 
it estimates for each path p the probability P(V; | p) that analysis i detects p 
as a real counterexamplq"] We estimate the probability P(V; | p) using an uni- 
gram language model [9] in combination with the approach of Richter et al. 
for the abstraction of the syntactical paths p. Finally, the scheduler assigns a 
time budget to analysis 7 in proportion to the average probability of detecting a 
counterexample path on a testing task T (program plus open test goals): 


limit} ™ = 10s + 80s * Eyer [P(V; | p)] (1) 


Learning Probability Distribution. The probability distribution P(V; | p) is 
unknown. Thus, we aim to learn the distribution. To this end, we executed the 
value and predicate analysis separately on the TestComp’20 category coverage- 
branches and used the reported counterexamples, which are obviously counter- 
examples that can be decided by the reporting analysis, to pre-train our unigram 
language model [9]. At the beginning of each CoVERITEST execution, we load the 
pre-trained model and use the reported counterexamples to improve it during 


3 We choose the same iteration time limit as in TestComp’20 [8], which has been 
established by extensive evaluation of CoVeriTrst [5]. 

4 Note that it is not important that p is a real counterexample. We rather model the 
probability that the analysis 7 can decide whether p is a counterexample than to 
decide whether p is a counterexample. 
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execution. When the sampled paths are indecisive, E,er[P(V; | p)] becomes 
the normalized progress used in the TestComp’20 strategy [8]. The normalized 
progress describes the relative contribution of an analysis to the goals covered 
in the last iteration. 


2 Tool Architecture 


(version 2.0) and is written in Java. For parsing, we use the Eclipse CDT parse 
For test-case generation, we rely on two instances of CPACHECKER’s test-case 
generation algorithm, which extracts test cases from counterexamples [I]. One 
instance generates test cases based on CPACHECKER’S value analysis [4| and the 
other instance uses CPACHECKER’s predicate analysis [3]. Both analyses apply 
counterexample-guided abstraction refinement [7] and use the SMT solver Math- 
SAT5 [6]. We interleave the two instances and determine their time slices based 
on their expected success on the set of open test goals. To determine the time 
slices, we added the adaptive scheduler described in the previous section. 


CoVERITEST is an extension of the software analysis framework eee 
5 
i 


3 Strengths and Weaknesses 


The main difference between COVERITEST versions in Test-Comp’20 and Test- 
Comp’21 is the distribution of the 100s per round. Our own experiments with 
the Test-Comp 2020 benchmark set revealed a small advantage for our new dis- 
tribution with respect to the coverage-branches category. Comparing the com- 
petition results against a CoVERITEsT configuration using the time distribution 
from Test-Comp’20 shows that the new distribution performs slightly worse in 
the coverage-error category. In total, 13 errors are missed, 8 of them are 
missed in the subcategory Floats. Overall, an advantage of the new distribu- 
tion is scarcely noticeable on the Test-Comp 2021 benchmark set. The unigram 
language model does not generalize well. 

Since the underlying analyses remain the same, COVERITEST still gener- 
ates a small number of test cases. Also, the problems with tasks using large 
arrays and the subcategories BusyBox-Memsafety and SQLite-Memsafety re- 
main. Additionally, CoVERITEsT performs poorly on the new ntdrivers tasks 
and the new subcategory Combinations. While finding the error in the new 
nla-digbench tasks is difficult, covering branches works well for these tasks. 
Moreover, COVERITEST deals well with the new category XCSP and the remain- 
ing new tasks. 


4 Setup 


We develop our extension of COVERITEST in a fork] of CPACHECKER and submit- 
ted revision 970d550, which participated in all categories. To run COVERITEST 


https: //www.eclipse.org/cdt 


ttps: //github.com/cedricrupb/cpachecker 
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on program program.i, one requires a Java 11 runtime environment and must 
execute the following command line: 


scripts/cpa.sh -testcomp21 -setprop log.consoleLevel=SEVERE -stats 
-benchmark -heap 10000m -spec property.prp program.i 


Note that property .prp is a place marker for the test specification (coverage- 
-error-call.prp or coverage-branches.prp). Tests are generated for pro- 
grams assuming a 32-bit environment. To support 64-bit environments, one 
needs to add the configuration option -64. The generated tests are written to 
the folder output/test-suite and adhere to the XML format demanded by the 
Test-Comp rules. Additionally, the folder contains the mandatory metadata file. 


5 Project and Contributors 


CoVERITEST is an extension of the CPACHECKER project] and is developed as a 
joint, open source project between research groups of Paderborn University and 
TU Darmstadt. Contributors are Marie-Christine Jakobs and Cedric Richter. 
We also like to thank all developers of CPACHECKER. 
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Abstract. We describe and evaluate a novel white-box fuzzer for C pro- 
grams named FuSeBMC, which combines fuzzing and symbolic execution, 
and applies Bounded Model Checking (BMC) to find security vulnera- 
bilities in C programs. FuSeBMC explores and analyzes C programs (1) 
to find execution paths that lead to property violations and (2) to in- 
crementally inject labels to guide the fuzzer and the BMC engine to 
produce test-cases for code coverage. FuSeBMC successfully participates 
in Test-Comp’21 and achieves first place in the Cover-Error category 
and second place in the Overall category. 


Keywords: Automated Test-Case Generation - Symbolic Execution - 
Bounded Model Checking - Fuzzing - Security. 


1 Test Generation Approach 


Automated test-case generation is a method to check whether the software 
matches expected requirements [2]. It involves the automated execution of soft- 
ware components to evaluate intricate properties and achieve code coverage met- 
rics (e.g., decision, branch, instruction). Here, we describe and evaluate a novel 
white-box fuzzer, FuSeBMC, capable of automatically producing test-cases for C 
programs. FuSeBMC provides an innovative software testing framework that de- 
tects security vulnerabilities in C programs by using fuzzing and symbolic execu- 
tion in combination with Bounded Model Checking (BMC) (cf. Fig. 1). FuSeBMC 
builds on top of clang [1] to instrument the C program, uses Map2check [8] as a 
fuzzing engine, and ESBMC (Efficient SMT-based Bounded Model Checker) [4,5] 
as BMC and symbolic execution engines, thus combining dynamic and static ver- 
ification techniques. 
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Fig. 1: FuSeBMC: a white-box fuzzer framework for C Programs. 


FuSeBMC takes a C program and a test specification [3] as input. In the 
Cover-Error category, FuSeBMC invokes the fuzzing and BMC engines sequen- 
tially to find a path that violates a given property. It uses an iterative BMC 
approach that incrementally unwinds the program until it finds a property vi- 
olation or exhausts time or memory limits. FuSeBMC uses incremental BMC 
to explore the program state space searching for a property violation since all 
programs in Test-Comp’21 are known to have issues. In the Cover-Branches 
category, FuSeBMC explores and analyzes the target C program using the clang 
compiler to inject labels incrementally. FuSeBMC will compute all branches of 
the C code and inject the labels for each branch by adding the label GOAL-N, 
where N is the goal number. Both engines will check whether these injected 
labels are reachable to produce test-cases for branch coverage. 

FuSeBMC analyzes the counterexamples and saves them as a graphmil file. 
It checks whether the fuzzing and BMC engines could produce counterexamples 
for both categories Cover-Error and Cover-Branches. If that is not the case, 
FuSeBMC employs a second fuzzing engine named selective fuzzer which produces 
test-cases for the rest of the labels. The selective fuzzer produces test-cases by 
learning from the two engines’ output: it analyzes the range of the inputs that 
should be passed to examine the target C program and then produces different 
test-cases. Lastly, FuSeBMC prepares valid test-cases with metadata to test a 
target C program using TestCov [3] as a test validator. 

FuSeBMC sets a 150 seconds limit for the fuzzing engine and a 700 seconds 
limit for the BMC engine and sets a 50 seconds limit for the selective fuzzer. 
These numbers were obtained empirically by analyzing the Test-Comp’21 results. 


2 Strengths and Weaknesses 


Incremental BMC allows FuSeBMC to keep unwinding the program until a prop- 
erty violation is found or time or memory limits are exhausted. This approach is 
advantageous in the Cover-Error category as finding one error is the primary 
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goal. Another strength of FuSeBMC is that it can accurately model C programs 
that use the IEEE floating-point arithmetic [6,7]. The floating-point encoding 
layer in our BMC engine extends the support for the SMT FP theory to solvers 
that do not support it natively. FuSeBMC can test programs with floating-point 
arithmetic using all currently supported solvers in BMC engine (ESBMC), in- 
cluding Boolector [9], which does not support the SMT FP theory natively. 

In both Cover-Error and Cover-Branches categories, various test-cases pro- 
duced by FuSeBMC are validated successfully. The majority of our test-cases 
were produced by the BMC engine and the selective fuzzer; our fuzzing engine 
did not produce many test-cases because it does not model the C library, so it 
mostly guesses the inputs. For example, in the Cover-Error category, TestCov 
confirms 500 test-cases produced by FuSeBMC, where our fuzzing engine pro- 
duces 13 (Map2Check), BMC engine produces 393 (ESBMC), while our selective 
fuzzer produces 94 test-cases (selective). 

However, note that our fuzzing engine is not limited to only produce test- 
cases. It helps our selective fuzzer by providing information about the number of 
inputs required to trigger a property violation, i.e., the number of assignments 
required to reach an error. In several cases, the BMC engine can exhaust the time 
limit before providing such information, e.g., when there are large arrays that 
need to be initialized at the beginning of the program. For example, consider 
the following code fragment extracted from the standard_copy1_ground-2.c 
benchmark, as illustrated in Fig. 2. 


1 #define N 100000 

2 aoe 

3 int a, al[N], a2[N]; 

4 for (a=0; a<N ; a++) { 

5 al[a] = --_VERIFIER_nondet_int () ; 
6 a2[a] = _._VERIFIER_nondet_int () ; 
eJ 

8 eee 

ə for (int x =0 ; x <N ; x+4) 

10 _-VERIFIER assert (al[x] == a2[x]); 


Fig. 2: Code fragment that contains a large array. 


In this particular example, ESBMC exhausts the time limit before check- 
ing the assertion al[z] == a2[x]. Apart from that, our employed verification 
engines also demonstrate a certain level of weakness to produce test-cases due 
to the many optimizations we perform when converting the program to SMT. 
In particular, two techniques affected the test-case generation significantly: con- 
stant folding and slicing. Constant folding evaluates constants (which includes 
nondeterministic symbols) and propagates them throughout the formula during 
encoding, and slicing removes expression not in the path to trigger a property 
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violation. These two techniques can significantly reduce SMT solving time. How- 
ever, they can remove the expressions required to trigger a violation when the 
program is compiled, i.e., variable initialization might be optimized away, forcing 
FuSeBMC to generate a test-case with undefined behavior. 

Regarding our fuzzing engine, we identified a limitation to handle programs 
with pointer dereferences. The fuzzing engine keeps track of variables throughout 
the program but has issues identifying when they go out of scope. When we try 
to generate a test-case that triggers a pointer dereference, our fuzzing engine 
provides thrash values, and the selective fuzzer might create test-cases that do 
not reach the error. 


3 Tool Setup and Configuration 


In order to run our fusebmc . py script,’ one must set the architecture (i.e., 32 or 
64-bit), the competition strategy (i.e., k-induction, falsification, or incremental 
BMC), the property file path, and the benchmark path, as: 


fusebmc.py [-a {32, 64}] [-p PROPERTY_FILE] 
[-s {kinduction,falsi,incr,fixed}] 
[BENCHMARK_PATH] 


where -a sets the architecture, -p sets the property file path, and -s sets 
the strategy (e.g., kinduction, falsi, incr, or fixed). For Test-Comp’21, 
FuSeBMC uses incr for incremental BMC. 

When choosing the fuzzing engine, we set the following options when execut- 
ing Map2Check: timeout of 150 seconds for Map2Check in Cover-Error, and a 
timeout of 70 seconds in Cover-Branches; --fuzzer-mb 1000 limits memory to 
1000 MB; --target-function-name reach—error defines the function name 
to be searched; --target-function checks whether the target-function is reach- 
able; --nondet-generator fuzzer uses only fuzzing; --generate-witness sets 
the witness output path. 

By choosing incremental BMC, the following options are set when executing 
ESBMC: --no-div-by-zero-check disables the division by zero check (required 
by Test-Comp); --force-malloc-success sets that all dynamic allocations suc- 
ceed (a Test-Comp requirement); --floatbv enables floating-point SMT encod- 
ing; --incremental-bmc enables incremental BMC; --unlimited-k-steps re- 
moves the upper limit of iteration steps for incremental BMC; --witness-output 
sets the witness output path; --no-bounds-check and --no-pointer-check 
disable bounds-check and pointer-safety checks, resp., since we are only inter- 
ested in finding reachability bugs; --k-step 5 sets the incremental BMC to 5; 
--no-allign-check disables pointer alignment checks; and --no-slice disables 
slicing of unnecessary instructions. 

The Benchexec tool info module is named fusebmc.py and the benchmark 
definition file is FuSeBMC.xm1. 


5 https: //gitlab.com/sosy-lab/test-comp /archives-2021 /- /blob/master/2021/ 
FuSeBMC.zip 


A White-Box Fuzzer for Finding Security Vulnerabilities in C Programs 367 


4 Software Project 


The FuSeBMC source code is written in C++ and it is available for downloading 
at GitHub,° which includes the latest release of FuSeBMC v3.6.6. FuSeBMC is 
publicly available under the terms of the MIT License. Instructions for building 
FuSeBMC from the source code are given in the file README.md (including the 
description of all dependencies). 
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Abstract. The setup of SYMBIOTIC 8 for Test-Comp 2021 brings radical 
changes in the test generation for coverage-branches property. Similarly 
as in SYMBIOTIC 7, we generate tests by running our fork of symbolic 
executor KLEE on the analyzed program. SYMBIOTIC 8, however, runs 
several instances of KLEE in parallel. We run one instance of KLEE on 
the original program and, simultaneously, we create one (intentionally 
unsound) program slice for every program-terminating instruction in the 
program and run KLEE on these slices. Apart from this principal change, 
we also improved other components of the tool, mainly the program 
slicer. Further, our fork of KLEE now supports symbolic pointer arith- 
metics and comparison of symbolic addresses. 


1 Test-Generation Approach 


SYMBIOTIC [3,2] is an open-source program analysis framework that combines 
static analyses with code transformations in order to enable faster analysis of 
the code. In the setup for Test-Comp 2021, SYMBIOTIC uses program slicing [6] 
in combination with symbolic execution [5]. 

Static (backward) program slicing [6] is a technique that removes program 
instructions that have no influence on reachability or the effect of selected parts 
of the program. In Test-Comp, we use program slicing for all properties. For 
coverage-error-call property, we slice the program to remove instructions 
that cannot affect reachability of the error location. For coverage-branches 
property, we use program slicing to create modified versions of the program on 
which we are likely to quickly generate tests that reach hard-to-cover parts of 
the program. 

Symbolic execution [5] is a program analysis technique that enumerates all 
possible execution paths of a program. For every path, it computes its corre- 
sponding path condition, which is a collection of constraints on program inputs 
that forms the necessary and sufficient condition to follow the path. Each path 
condition is then used to create a test that makes the program execute the given 
path. 
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1.1 Workflow of Symbiotic 8 


The workflow of SYMBIOTIC 8 in Test-Comp 2021 for the property 
coverage-error-call is the same as in SYMBIOTIC 7: we slice the analyzed 
program with respect to calls of the error function and run KLEE on this sliced 
program. If KLEE finds a feasible path that calls the error function, we attempt 
to replay this path in the unsliced program to fill in the possibly missing values 
returned from calls to functions _.VERIFIER_nondet_* that may have been sliced 
away. 

The workflow for the property coverage-branches changed significantly in 
SYMBIOTIC 8. For this property, we run several instances of KLEE in parallel: 
one instance on the original program and other instances on slices generated for 
every terminating location in the program. 

More precisely, we create a pool of processes that keeps running at most 
8 processes at the same time (on the first-come-first-served basis). We start an 
instance of KLEE on the original program and add it to the pool. Then we identify 
instructions in the program that terminate the execution (further referred to as 
targets). For each target, we create a slice and queue a run of KLEE on this slice. 

These slices are unsound in the sense that they do not preserve all execution 
paths to the targets. A slice is constructed in two steps: 


1. We gather all instructions that are backward-reachable from the target in 
the target’s function and recursively in the callers of the target’s function. 
However, we move only up the call stack and do not submerge into procedures 
during this process. 

2. After we gather all such instructions, we replace all other instructions with a 
call to abort and apply standard program slicing with respect to the target. 


For example, consider the code on the left in Figure 1. It contains three possi- 
ble targets, namely error () (line 7), abort () (line 13), and return 0 (line 17). 
If we slice with respect to the target error(), we start searching the program 
backwards from this target and get all instructions in the body of function foo. 
Then we pop up from the call to line 16 and collect all instructions of function 
main except the call to abort (from which the call to foo is unreachable). All 
instructions except the gathered ones are replaced with a call to abort. Standard 
program slicing then produces the program depicted in the middle in Figure 1 (in 
this case, it just removes the return). The slice for the target abort () preserves 
only three first lines of main as depicted on the right in Figure 1. 

Whenever the main instance of KLEE finishes tests generation, we have tests 
for all feasible execution paths of the program. Therefore, we kill all other run- 
ning instances of KLEE and discard tests that were not generated by the main 
instance to reduce the size of the test suite. If the main instance does not finish 
before timeout, we keep all generated tests. 

Using the unsound slices aims only to help reaching hard-to-cover places in 
the program. In particular, potentially expensive detours are replaced by abort 
and symbolic execution thus does not waste resources to discover them (see 
line 2 in the middle in Figure 1). The current construction of unsound slices 


370 


M. Chalupa et al. 


int inc(int x) { 


int ine (int =x) { 


2 return x + 1; abort (); 

3 } } 

4 

5 void foo(int x) { void foo(int x) { 
6 if (x > 0) if Cx > 0) 

7 error(); error (); 


F 


int main() { 
int y = nondet(); 


} 


int main() { 
int y = nondet(); 


int main() { 
int y = nondet(); 


12 if (y < 0) if (y < 0) if (y < 0) 
13 abort () ; abort (); abort (); 
14 if (y == 0) if (y == 0) 

15 y = inc(y); y = inc(y); 

16 foo(y); foo(y); 

17 return 0; 

te +} } F 


Fig. 1. And example of a program (left) and its unsound slice with respect to the call 
of error() (middle) and abort () (right). 


guarantees that if a test covers a target in the corresponding slice, then it covers 
the same target also in the original program. The opposite implication does not 
hold due to the unsoundness. Note that tests generated from the slices may not 
and usually do not cover all branches in the original program, therefore we still 
need to run KLEE on the original program. 


2 Software Architecture 


All parts of SYMBIOTIC 8 use LLVM 10 [7]. We compile the analyzed program 
into LLVM bitcode by the compiler CLANG. 

To carry out symbolic execution, we use our fork of the open-source sym- 
bolic executor KLEE [1]. The fork has several modifications compared to the 
mainstream KLEE. The main modification is the representation of pointers as 
segment-offset pairs that enables symbolic-sized allocations. Since this year, our 
fork KLEE also supports comparison of and arithmetic on symbolic pointers. 
We use Z3 [4] as the SMT solver in KLEE. The components of SYMBIOTIC are 
programmed in C++ and the scripts that schedule and control running these 
components are written in Python. 


3 Strengths and Weaknesses 


Although symbolic execution is very good in generating test-cases, it suffers from 
the path explosion problem. This problem emerges on programs that contain 
many branching instructions or loops with the number of iterations dependent on 
the input and may hinder symbolic execution from exploring “deep” parts of the 
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Fig. 2. The coverage achieved by SYMBIOTIC 8 and 7 on individual benchmarks of the 
Cover-Branches category 


program. Using unsound program slices for terminating instructions attempts to 
alleviate this problem. Although the slice is not guaranteed to preserve paths to 
the target for which it was created, there are programs where this technique helps 
symbolic execution to cover substantially more instructions. However, there are 
also many cases where the technique worsens the coverage alike. 

Figure 2 illustrates the overall positive and negative effect of this approach. 
The scatter plot on the left compares the coverage achieved by SYMBIOTIC 8 
and the coverage achieved by SYMBIOTIC 7 on individual benchmarks that were 
used in both Test-Comp 2020 and 2021.! The scatter plot shows that the be- 
havior of the tool changes dramatically. To summarize the data, we compute 
the difference between the two coverages on each benchmark (for example, if 
SYMBIOTIC 8 achieves 80% and SYMBIOTIC 7 60% coverage, the difference is 
+20%). The histogram on the right indicates that the overall effect of unsound 
slices is positive as the distribution is skewed to positive values. Indeed, SYM- 
BIOTIC 8 won the 3rd place in the category Cover-Branches (corresponding to 
coverage-branches property) in Test-Comp 2021 which is a big improvement 
over the previous Test-Comp, where SYMBIOTIC was 8th out of 9 participants 
in this category. 

The workflow of SYMBIOTIC on coverage-error-call did not change from 
the last year and thus the results are similar. 


4 Tool Setup and Configuration 


The archive is available at https://doi.org/10.5281/zenodo.4491729. Run SYM- 
BIOTIC with the following command 


1 The use of unsound slices is not the only difference between SYMBIOTIC 8 and 7, but 
we believe that it has the biggest impact on the presented results. 
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bin/symbiotic --test-comp --prp <prpfile> [--32] <source> 


where --prp sets the verified property and --32 tells SYMBIOTIC to assume 
32-bit architecture (64-bit architecture is assumed by default). The generated 
test-cases are stored in the directory test-suite. 


5 Software Project and Contributors 


SYMBIOTIC 8 as it competes in Test-Comp 2021 has been developed by Marek 
Chalupa and Jakub Novak under the supervision of Jan Strejéek. The tool and 
its components are available under MIT License. LLVM, KLEE, and Z3 are also 
available under open-source licenses. The project web page is: 


https: //github.com/staticafi/symbiotic 
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